linux-2.6

dect

Archived

Author	SHA1	Message	Date
Tejun Heo	29e2b09ab5	block: collapse blk_alloc_request() into get_request() Allocation failure handling in get_request() is about to be updated. To ease the update, collapse blk_alloc_request() into get_request(). This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:40 +02:00
Tejun Heo	f9fcc2d391	blkcg: collapse blkcg_policy_ops into blkcg_policy There's no reason to keep blkcg_policy_ops separate. Collapse it into blkcg_policy. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:17 +02:00
Tejun Heo	f95a04afa8	blkcg: embed struct blkg_policy_data in policy specific data Currently blkg_policy_data carries policy specific data as char flex array instead of being embedded in policy specific data. This was forced by oddities around blkg allocation which are all gone now. This patch makes blkg_policy_data embedded in policy specific data - throtl_grp and cfq_group so that it's more conventional and consistent with how io_cq is handled. * blkcg_policy->pdata_size is renamed to ->pd_size. * Functions which used to take void pdata now takes struct blkg_policy_data pd. * blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg(). * Dummy struct blkg_policy_data definition added. Dummy pdata_to_blkg() definition was unused and inconsistent with the non-dummy version - correct dummy pd_to_blkg() added. * throtl and cfq updated accordingly. * As dummy blkg_to_pd/pd_to_blkg() are provided, blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd. Moved outside ifdef block. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:17 +02:00
Tejun Heo	3c798398e3	blkcg: mass rename of blkcg API During the recent blkcg cleanup, most of blkcg API has changed to such extent that mass renaming wouldn't cause any noticeable pain. Take the chance and cleanup the naming. * Rename blkio_cgroup to blkcg. * Drop blkio / blkiocg prefixes and consistently use blkcg. * Rename blkio_group to blkcg_gq, which is consistent with io_cq but keep the blkg prefix / variable name. * Rename policy method type and field names to signify they're dealing with policy data. * Rename blkio_policy_type to blkcg_policy. This patch doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:17 +02:00
Tejun Heo	36558c8a30	blkcg: style cleanups for blk-cgroup.h * Update indentation on struct field declarations. * Uniformly don't use "extern" on function declarations. * Merge the two #ifdef CONFIG_BLK_CGROUP blocks. All changes in this patch are cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:16 +02:00
Tejun Heo	54e7ed12ba	blkcg: remove blkio_group->path[] blkio_group->path[] stores the path of the associated cgroup and is used only for debug messages. Just format the path from blkg->cgroup when printing debug messages. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:16 +02:00
Tejun Heo	c94bed8999	blkcg: blkg_rwstat_read() was missing inline blkg_rwstat_read() in blk-cgroup.h was missing inline modifier causing compile warning depending on configuration. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:16 +02:00
Tejun Heo	6d18b008da	blkcg: shoot down blkgs if all policies are deactivated There's no reason to keep blkgs around if no policy is activated for the queue. This patch moves queue locking out of blkg_destroy_all() and call it from blkg_deactivate_policy() on deactivation of the last policy on the queue. This change was suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	3c96cb32d3	blkcg: drop stuff unused after per-queue policy activation update * All_q_list is unused. Drop all_q_{mutex\|list}. * @for_root of blkg_lookup_create() is always %false when called from outside blk-cgroup.c proper. Factor out __blkg_lookup_create() so that it doesn't check whether @q is bypassing and use the underscored version for the @for_root callsite. * blkg_destroy_all() is used only from blkcg proper and @destroy_root is always %true. Make it static and drop @destroy_root. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	a2b1693bac	blkcg: implement per-queue policy activation All blkcg policies were assumed to be enabled on all request_queues. Due to various implementation obstacles, during the recent blkcg core updates, this was temporarily implemented as shooting down all !root blkgs on elevator switch and policy [de]registration combined with half-broken in-place root blkg updates. In addition to being buggy and racy, this meant losing all blkcg configurations across those events. Now that blkcg is cleaned up enough, this patch replaces the temporary implementation with proper per-queue policy activation. Each blkcg policy should call the new blkcg_[de]activate_policy() to enable and disable the policy on a specific queue. blkcg_activate_policy() allocates and installs policy data for the policy for all existing blkgs. blkcg_deactivate_policy() does the reverse. If a policy is not enabled for a given queue, blkg printing / config functions skip the respective blkg for the queue. blkcg_activate_policy() also takes care of root blkg creation, and cfq_init_queue() and blk_throtl_init() are updated accordingly. This replaces blkcg_bypass_{start\|end}() and update_root_blkg_pd() unnecessary. Dropped. v2: cfq_init_queue() was returning uninitialized @ret on root_group alloc failure if !CONFIG_CFQ_GROUP_IOSCHED. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	03d8e11142	blkcg: add request_queue->root_blkg With per-queue policy activation, root blkg creation will be moved to blkcg core. Add q->root_blkg in preparation. For blk-throtl, this replaces throtl_data->root_tg; however, cfq needs to keep cfqd->root_group for !CONFIG_CFQ_GROUP_IOSCHED. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	b82d4b197c	blkcg: make request_queue bypassing on allocation With the previous change to guarantee bypass visiblity for RCU read lock regions, entering bypass mode involves non-trivial overhead and future changes are scheduled to make use of bypass mode during init path. Combined it may end up adding noticeable delay during boot. This patch makes request_queue start its life in bypass mode, which is ended on queue init completion at the end of blk_init_allocated_queue(), and updates blk_queue_bypass_start() such that draining and RCU synchronization are performed only when the queue actually enters bypass mode. This avoids unnecessarily switching in and out of bypass mode during init avoiding the overhead and any nasty surprises which may step from leaving bypass mode on half-initialized queues. The boot time overhead was pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	80fd99792b	blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing Currently, blkg_lookup() doesn't check @q bypass state. This patch updates blk_queue_bypass_start() to do synchronize_rcu() before returning and updates blkg_lookup() to check blk_queue_bypass() and return %NULL if bypassing. This ensures blkg_lookup() returns %NULL if @q is bypassing. This is to guarantee that nobody is accessing policy data while @q is bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in place on policy [de]activation. v2: Added more comments explaining bypass guarantees as suggested by Vivek. v3: Added more comments explaining why there's no synchronize_rcu() in blk_cleanup_queue() as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	da8b066262	blkcg: make blkg_conf_prep() take @pol and return with queue lock held Add @pol to blkg_conf_prep() and let it return with queue lock held (to be released by blkg_conf_finish()). Note that @pol isn't used yet. This is to prepare for per-queue policy activation and doesn't cause any visible difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	8bd435b30e	blkcg: remove static policy ID enums Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate @pol->plid dynamically on registration. The maximum number of blkcg policies which can be registered at the same time is defined by BLKCG_MAX_POLS constant added to include/linux/blkdev.h. Note that blkio_policy_register() now may fail. Policy init functions updated accordingly and unnecessary ifdefs removed from cfq_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	ec399347d3	blkcg: use @pol instead of @plid in update_root_blkg_pd() and blkcg_print_blkgs() The two functions were taking "enum blkio_policy_id plid". Make them take "const struct blkio_policy_type *pol" instead. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	bc0d6501a8	blkcg: kill blkio_list and replace blkio_list_lock with a mutex With blkio_policy[], blkio_list is redundant and hinders with per-queue policy activation. Remove it. Also, replace blkio_list_lock with a mutex blkcg_pol_mutex and let it protect the whole [un]registration. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	f48ec1d788	cfq: fix build breakage & warnings * CFQ_WEIGHT_* defined inside CONFIG_BLK_CGROUP causes cfq-iosched.c compile failure when the config is disabled. Move it outside the ifdef block. * Dummy cfqg_stats_*() definitions were lacking inline modifiers causing unused functions warning if !CONFIG_CFQ_GROUP_IOSCHED. Add them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Linus Torvalds	d8dd0b6d48	Merge branch 'for-3.4/core' of git://git.kernel.dk/linux-block Pull block core bits from Jens Axboe: "It's a nice and quiet round this time, since most of the tricky stuff has been pushed to 3.5 to give it more time to mature. After a few hectic block IO core changes for 3.3 and 3.2, I'm quite happy with a slow round. Really minor stuff in here, the only real functional change is making the auto-unplug threshold a per-queue entity. The threshold is set so that it's low enough that we don't hold off IO for too long, but still big enough to get a nice benefit from the batched insert (and hence queue lock cost reduction). For raid configurations, this currently breaks down." * 'for-3.4/core' of git://git.kernel.dk/linux-block: block: make auto block plug flush threshold per-disk based Documentation: Add sysfs ABI change for cfq's target latency. block: Make cfq_target_latency tunable through sysfs. block: use lockdep_assert_held for queue locking block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL	2012-04-13 18:07:19 -07:00
Shaohua Li	1b2e19f17e	block: make auto block plug flush threshold per-disk based We do auto block plug flush to reduce latency, the threshold is 16 requests. This works well if the task is accessing one or two drives. The problem is if the task is accessing a raid 0 device and the raid disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have heavy lock contention. This patch makes the threshold per-disk based. The latency should be still ok accessing one or two drives. The setup with application accessing a lot of drives in the meantime uaually is big machine, avoiding lock contention is more important, because any contention will actually increase latency. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-06 11:37:47 -06:00
Tejun Heo	5bc4afb1ec	blkcg: drop BLKCG_STAT_{PRIV\|POL\|OFF} macros Now that all stat handling code lives in policy implementations, there's no need to encode policy ID in cft->private. * Export blkcg_prfill_[rw]stat() from blkcg, remove blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which use hard-code BLKIO_POLICY_PROP. * Use cft->private for offset of the target field directly and drop BLKCG_STAT_{PRIV\|POL\|OFF}(). Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:45 -07:00
Tejun Heo	d366e7ec41	blkcg: pass around pd->pdata instead of pd itself in prfill functions Now that all conf and stat fields are moved into policy specific blkio_policy_data->pdata areas, there's no reason to use blkio_policy_data itself in prfill functions. Pass around @pd->pdata instead of @pd. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	af133ceb26	blkcg: move blkio_group_conf->iops and ->bps to blk-throttle blkio_cgroup_conf->iops and ->bps are owned by blk-throttle and has no reason to be defined in blkcg core. Drop them and let conf setting functions directly manipulate throtl_grp->bps[] and ->iops[]. This makes blkio_group_conf empty. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	3381cb8d2e	blkcg: move blkio_group_conf->weight to cfq blkio_group_conf->weight is owned by cfq and has no reason to be defined in blkcg core. Replace it with cfq_group->dev_weight and let conf setting functions directly set it. If dev_weight is zero, the cfqg doesn't have device specific weight configured. Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename blkio_cgroup->weight to blkio_cgroup->cfq_weight. We eventually want per-policy storage in blkio_cgroup but just mark the ownership of the field for now. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	8a3d26151f	blkcg: move blkio_group_stats_cpu and friends to blk-throttle.c blkio_group_stats_cpu is used only by blk-throtl and has no reason to be defined in blkcg core. * Move blkio_group_stats_cpu to blk-throttle.c and rename it to tg_stats_cpu. * blkg_policy_data->stats_cpu is replaced with throtl_grp->stats_cpu. prfill functions updated accordingly. * All related macros / functions are renamed so that they have tg_ prefix and the unnecessary @pol arguments are dropped. * Per-cpu stats allocation code is also moved from blk-cgroup.c to blk-throttle.c and gets simplified to only deal with BLKIO_POLICY_THROTL. percpu stat free is performed by the exit method throtl_exit_blkio_group(). * throtl_reset_group_stats() implemented for blkio_reset_group_stats_fn method so that tg->stats_cpu can be reset. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	155fead9b6	blkcg: move blkio_group_stats to cfq-iosched.c blkio_group_stats contains only fields used by cfq and has no reason to be defined in blkcg core. * Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats. * blkg_policy_data->stats is replaced with cfq_group->stats. blkg_prfill_[rw]stat() are updated to use offset against pd->pdata instead. * All related macros / functions are renamed so that they have cfqg_ prefix and the unnecessary @pol arguments are dropped. * All stat functions now take cfq_group * instead of blkio_group . lockdep assertion on queue lock dropped. Elevator runs under queue lock by default. There isn't much to be gained by adding lockdep assertions at stat function level. * cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method so that cfqg->stats can be reset. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	9ade5ea4ce	blkcg: add blkio_policy_ops operations for exit and stat reset Add blkio_policy_ops->blkio_exit_group_fn() and ->blkio_reset_group_stats_fn(). These will be used to further modularize blkcg policy implementation. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	41b38b6d54	blkcg: cfq doesn't need per-cpu dispatch stats blkio_group_stats_cpu is used to count dispatch stats using per-cpu counters. This is used by both blk-throtl and cfq-iosched but the sharing is rather silly. * cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates those stats while holding queue_lock. * blk-throtl needs per-cpu dispatch stats but only service_bytes and serviced. It doesn't make use of sectors. This patch makes cfq add and use global stats for service_bytes, serviced and sectors, removes per-cpu sectors counter and moves per-cpu stat printing code to blk-throttle.c. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	629ed0b102	blkcg: move statistics update code to policies As with conf/stats file handling code, there's no reason for stat update code to live in blkcg core with policies calling into update them. The current organization is both inflexible and complex. This patch moves stat update code to specific policies. All blkiocg_update__stats() functions which deal with BLKIO_POLICY_PROP stats are collapsed into their cfq_blkiocg_update__stats() counterparts. blkiocg_update_dispatch_stats() is used by both policies and duplicated as throtl_update_dispatch_stats() and cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	2ce4d50f9c	cfq: collapse cfq.h into cfq-iosched.c block/cfq.h contains some functions which interact with blkcg; however, this is only part of it and cfq-iosched.c already has quite some #ifdef CONFIG_CFQ_GROUP_IOSCHED. With conf/stat handling being moved to specific policies, having these relay functions isolated in cfq.h doesn't make much sense. Collapse cfq.h into cfq-iosched.c for now. Let's split blkcg support properly later if necessary. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	60c2bc2d5a	blkcg: move conf/stat file handling code to policies blkcg conf/stat handling is convoluted in that details which belong to specific policy implementations are all out in blkcg core and then policies hook into core layer to access and manipulate confs and stats. This sadly achieves both inflexibility (confs/stats can't be modified without messing with blkcg core) and complexity (all the call-ins and call-backs). The previous patches restructured conf and stat handling code such that they can be separated out. This patch relocates the file handling part. All conf/stat file handling code which belongs to BLKIO_POLICY_PROP is moved to cfq-iosched.c and all BKLIO_POLICY_THROTL code to blk-throtl.c. The move is verbatim except for blkio_update_group_{weight\|bps\|iops}() callbacks which relays conf changes to policies. The configuration settings are handled in policies themselves so the relaying isn't necessary. Conf setting functions are modified to directly call per-policy update functions and the relaying mechanism is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	44ea53de46	blkcg: implement blkio_policy_type->cftypes Add blkiop->cftypes which is added and removed together with the policy. This will be used to move conf/stat handling to the policies. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	829fdb5000	blkcg: export conf/stat helpers to prepare for reorganization conf/stat handling is about to be moved to policy implementation from blkcg core. Export conf/stat helpers from blkcg core so that blk-throttle and cfq-iosched can use them. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	726fa6945e	blkcg: simplify blkg_conf_prep() blkg_conf_prep() implements "MAJ:MIN VAL" parsing manually, which is unnecessary. Just use sscanf("%u:%u %llu"). This might not reject some malformed input (extra input at the end) but we don't care. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	3a8b31d396	blkcg: restructure blkio_group configruation setting As part of userland interface restructuring, this patch updates per-blkio_group configuration setting. Instead of funneling everything through a master function which has hard-coded cases for each config file it may handle, the common part is factored into blkg_conf_prep() and blkg_conf_finish() and different configuration setters are implemented using the helpers. While this doesn't result in immediate LOC reduction, this enables further cleanups and more modular implementation. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	c4682aec9c	blkcg: restructure configuration printing Similarly to the previous stat restructuring, this patch restructures conf printing code such that, * Conf printing uses the same helpers as stat. * Printing function doesn't require hardcoded switching on the config being printed. Note that this isn't complete yet for throttle confs. The next patch will convert setting for these confs and will complete the transition. * Printing uses read_seq_string callback (other methods will be phased out). Note that blkio_group_conf.iops[2] is changed to u64 so that they can be manipulated with the same functions. This is transitional and will go away later. After this patch, per-device configurations - weight, bps and iops - use __blkg_prfill_u64() for printing which uses white space as delimiter instead of tab. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	627f29f481	blkcg: drop blkiocg_file_write_u64() blkiocg_file_write_u64() has single switch case. Drop blkiocg_file_write_u64(), rename blkio_weight_write() to blkcg_set_weight() and use it directly for .write_u64 callback. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	d3d32e69fa	blkcg: restructure statistics printing blkcg stats handling is a mess. None of the stats has much to do with blkcg core but they are all implemented in blkcg core. Code sharing is achieved by mixing common code with hard-coded cases for each stat counter. This patch restructures statistics printing such that * Common logic exists as helper functions and specific print functions use the helpers to implement specific cases. * Printing functions serving multiple counters don't require hardcoded switching on specific counters. * Printing uses read_seq_string callback (other methods will be phased out). This change enables further cleanups and relocating stats code to the policy implementation it belongs to. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:42 -07:00
Tejun Heo	edcb0722c6	blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ\|WRITE\|SYNC\|ASYNC\|TOTAL} renamed to BLKG_RWSTAT_{READ\|WRITE\|SYNC\|ASYNC\|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's way better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START\|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:42 -07:00
Tejun Heo	2aa4a1523b	blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcounters BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters and is counted by blkio_group_stats_cpu->sectors; however, it still holds a member in blkio_group_stats_cpu->stat_arr_cpu. Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have subcounters. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:42 -07:00
Tejun Heo	aaec55a002	blkcg: remove unused @pol and @plid parameters @pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no longer necessary. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:42 -07:00
Tao Ma	5bf14c0727	block: Make cfq_target_latency tunable through sysfs. In cfq, when we calculate a time slice for a process(or a cfqq to be precise), we have to consider the cfq_target_latency so that all the sync request have an estimated latency(300ms) and it is controlled by cfq_target_latency. But in some hadoop test, we have found that if there are many processes doing sequential read(24 for example), the throughput is bad because every process can only work for about 25ms and the cfqq is switched. That leads to a higher disk seek. We can achive the good throughput by setting low_latency=0, but then some read's latency is too much for the application. So this patch makes cfq_target_latency tunable through sysfs so that we can tune it and find some magic number which is not bad for both the throughput and the read latency. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-01 14:33:39 -07:00
Tejun Heo	959d851caa	Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-merged cgroup/for-3.5 contains the following changes which blk-cgroup needs to proceed with the on-going cleanup. * Dynamic addition and removal of cftypes to make config/stat file handling modular for policies. * cgroup removal update to not wait for css references to drain to fix blkcg removal hang caused by cfq caching cfqgs. Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the following conflicts in block/blk-cgroup.c. * `761b3ef50e` "cgroup: remove cgroup_subsys argument from callbacks" conflicts with blkiocg_pre_destroy() addition and blkiocg_attach() removal. Resolved by removing @subsys from all subsys methods. * `676f7c8f84` "cgroup: relocate cftype and cgroup_subsys definitions in controllers" conflicts with ->pre_destroy() and ->attach() updates and removal of modular config. Resolved by dropping forward declarations of the methods and applying updates to the relocated blkio_subsys. * `4baf6e3325` "cgroup: convert all non-memcg controllers to the new cftype interface" builds upon the previous item. Resolved by adding ->base_cftypes to the relocated blkio_subsys. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 12:55:00 -07:00
Tejun Heo	4baf6e3325	cgroup: convert all non-memcg controllers to the new cftype interface Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio, net_cls and device controllers to use the new cftype based interface. Termination entry is added to cftype arrays and populate callbacks are replaced with cgroup_subsys->base_cftypes initializations. This is functionally identical transformation. There shouldn't be any visible behavior change. memcg is rather special and will be converted separately. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vivek Goyal <vgoyal@redhat.com>	2012-04-01 12:09:55 -07:00
Tejun Heo	676f7c8f84	cgroup: relocate cftype and cgroup_subsys definitions in controllers blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol unnecessarily define cftype array and cgroup_subsys structures at the top of the file, which is unconventional and necessiates forward declaration of methods. This patch relocates those below the definitions of the methods and removes the forward declarations. Note that forward declaration of tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup(). This will be removed soon by another patch. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com>	2012-04-01 12:09:55 -07:00
Andi Kleen	8bcb6c7d48	block: use lockdep_assert_held for queue locking Instead of an ugly open coded variant. Cc: axboe@kernel.dk Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-30 12:33:28 +02:00
Dan Carpenter	a5567932fc	blkcg: change a spin_lock() to spin_lock_irq() Smatch complains that we re-enable IRQs twice. It looks like we forgot to disable them here on the spin_trylock() failure path. This was added in `9f13ef678e` "blkcg: use double locking instead of RCU for blkg synchronization". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>` Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-29 20:57:08 +02:00
Tejun Heo	eb7d8c07f9	cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHED When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up calling blkg_get/put() on dummy cfqg leading to the following crash. BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs RIP: 0010:[<ffffffff813d44d8>] [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 RSP: 0018:ffff88001f9dfd80 EFLAGS: 00010046 RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40 RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040) Stack: ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0 ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30 ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507 Call Trace: [<ffffffff813b0507>] elevator_init+0xd7/0x140 [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150 [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80 [<ffffffff813b9523>] blk_init_queue+0x13/0x20 [<ffffffff821aec00>] floppy_init+0x82/0xec7 [<ffffffff810001d2>] do_one_initcall+0x42/0x170 [<ffffffff821835fc>] kernel_init+0xcb/0x14f [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10 Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0 RIP [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED, separate from BLK_CGROUP, blkg access through cfqg needs to be conditioned on it. * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on CFQ_GROUP_IOSCHED. If disabled, they always return %NULL. * Introduce cfqg_get() and cfqg_put() conditioned on CFQ_GROUP_IOSCHED. If disabled, they are noops. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-23 14:02:53 +01:00
Dan Carpenter	00380a404f	block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL We should use the GFP flags that the caller specified instead of picking our own. All the callers specify GFP_KERNEL so this doesn't make a difference to how the kernel runs, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-23 09:58:54 +01:00
Linus Torvalds	0d9cabdcce	Merge branch 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Out of the 8 commits, one fixes a long-standing locking issue around tasklist walking and others are cleanups." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list cgroup: Remove wrong comment on cgroup_enable_task_cg_list() cgroup: remove cgroup_subsys argument from callbacks cgroup: remove extra calls to find_existing_css_set cgroup: replace tasklist_lock with rcu_read_lock cgroup: simplify double-check locking in cgroup_attach_proc cgroup: move struct cgroup_pidlist out from the header file cgroup: remove cgroup_attach_task_current_cg()	2012-03-20 18:11:21 -07:00
Linus Torvalds	2ba68940c8	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes for v3.4 from Ingo Molnar * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) printk: Make it compile with !CONFIG_PRINTK sched/x86: Fix overflow in cyc2ns_offset sched: Fix nohz load accounting -- again! sched: Update yield() docs printk/sched: Introduce special printk_sched() for those awkward moments sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer sched: Cleanup cpu_active madness sched: Fix load-balance wreckage sched: Clean up parameter passing of proc_sched_autogroup_set_nice() sched: Ditch per cgroup task lists for load-balancing sched: Rename load-balancing fields sched: Move load-balancing arguments into helper struct sched/rt: Do not submit new work when PI-blocked sched/rt: Prevent idle task boosting sched/wait: Add __wake_up_all_locked() API sched/rt: Document scheduler related skip-resched-check sites sched/rt: Use schedule_preempt_disabled() sched/rt: Add schedule_preempt_disabled() sched/rt: Do not throttle when PI boosting sched/rt: Keep period timer ticking when rt throttling is active ...	2012-03-20 10:31:44 -07:00
Tejun Heo	2b566fa55b	block: remove ioc_*_changed() After the previous patch to cfq, there's no ioc_get_changed() user left. This patch yanks out ioc_{ioprio\|cgroup\|get}_changed() and all related stuff. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:47:48 +01:00
Tejun Heo	598971bfbd	cfq: don't use icq_get_changed() cfq caches the associated cfqq's for a given cic. The cache needs to be flushed if the cic's ioprio or blkcg has changed. It is currently done by requiring the changing action to set the respective ICQ__CHANGED bit in the icq and testing it from cfq_set_request(), which involves iterating through all the affected icqs. All cfq wants to know is whether ioprio and/or blkcg have changed since the last flush and can be easily achieved by just remembering the current ioprio and blkcg ID in cic. This patch adds cic->{ioprio\|blkcg_id}, updates all ioprio users to use the remembered value instead, and updates cfq_set_request() path such that, instead of using icq_get_changed(), the current values are compared against the remembered ones and trigger appropriate flush action if not. Condition tests are moved inside both _changed functions which are now named check_ioprio_changed() and check_blkcg_changed(). ioprio.h::task_ioprio() can't be used anymore and replaced with open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:47:47 +01:00
Tejun Heo	abede6da27	cfq: pass around cfq_io_cq instead of io_context Now that io_cq is managed by block core and guaranteed to exist for any in-flight request, it is easier and carries more information to pass around cfq_io_cq than io_context. This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and cfq_get_queue() to take @cic instead of @ioc. This change removes a duplicate cfq_cic_lookup() from cfq_find_alloc_queue(). This change enables the use of cic-cached ioprio in the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:47:47 +01:00
Tejun Heo	9a9e8a26da	blkcg: add blkcg->id Add 64bit unique id to blkcg. This will be used by policies which want blkcg identity test to tell whether the associated blkcg has changed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:47:47 +01:00
Tejun Heo	edf1b879e3	blkcg: remove blkio_group->stats_lock With recent plug merge updates, all non-percpu stat updates happen under queue_lock making stats_lock unnecessary to synchronize stat updates. The only synchronization necessary is stat reading, which can be done using u64_stats_sync instead. This patch removes blkio_group->stats_lock and adds blkio_group_stats->syncp for reader synchronization. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:45:37 +01:00
Tejun Heo	c4c76a0538	blkcg: restructure blkio_get_stat() Restructure blkio_get_stat() to prepare for removal of stats_lock. * Define BLKIO_STAT_ARR_NR explicitly to denote which stats have subtypes instead of using BLKIO_STAT_QUEUED. * Separate out stat acquisition and printing. After this, there are only two users of blkio_fill_stat(). Just open code it. * The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1. There's no need to subtract one. Use MAX_KEY_LEN consistently. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:45:37 +01:00
Tejun Heo	997a026c80	blkcg: simplify stat reset blkiocg_reset_stats() implements stat reset for blkio.reset_stats cgroupfs file. This feature is very unconventional and something which shouldn't have been merged. It's only useful when there's only one user or tool looking at the stats. As soon as multiple users and/or tools are involved, it becomes useless as resetting disrupts other usages. There are very good reasons why all other stats expect readers to read values at the start and end of a period and subtract to determine delta over the period. The implementation is rather complex - some fields shouldn't be cleared and it saves some fields, resets whole and restores for some reason. Reset of percpu stats is also racy. The comment points to 64bit store atomicity for the reason but even without that stores for zero can simply race with other CPUs doing RMW and get clobbered. Simplify reset by * Clear selectively instead of resetting and restoring. * Grouping debug stat fields to be reset and using memset() over them. * Not caring about stats_lock. * Using memset() to reset percpu stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:45:37 +01:00
Tejun Heo	5fe224d2d5	blkcg: don't use percpu for merged stats With recent plug merge updates, merged stats are no longer called for plug merges and now only updated while holding queue_lock. As stats_lock is scheduled to be removed, there's no reason to use percpu for merged stats. Don't use percpu for merged stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:45:37 +01:00
Vivek Goyal	1cd9e039fc	blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:45:37 +01:00
Linus Torvalds	f1cbd03f5e	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Been sitting on this for a while, but lets get this out the door. This fixes various important bugs for 3.3 final, along with a few more trivial ones. Please pull!" * 'for-linus' of git://git.kernel.dk/linux-block: block: fix ioc leak in put_io_context block, sx8: fix pointer math issue getting fw version Block: use a freezable workqueue for disk-event polling drivers/block/DAC960: fix -Wuninitialized warning drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning block: fix __blkdev_get and add_disk race condition block: Fix setting bio flags in drivers (sd_dif/floppy) block: Fix NULL pointer dereference in sd_revalidate_disk block: exit_io_context() should call elevator_exit_icq_fn() block: simplify ioc_release_fn() block: replace icq->changed with icq->flags	2012-03-14 17:16:45 -07:00
Xiaotian Feng	ff8c1474cc	block: fix ioc leak in put_io_context When put_io_context is called, if ioc->icq_list is empty and refcount is 1, kernel will not free the ioc. This is caught by following kmemleak: unreferenced object 0xffff880036349fe0 (size 216): comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<ffffffff8169f926>] kmemleak_alloc+0x26/0x50 [<ffffffff81195a9c>] kmem_cache_alloc_node+0x1cc/0x2a0 [<ffffffff81356b67>] create_io_context_slowpath+0x27/0x130 [<ffffffff81356d2b>] get_task_io_context+0xbb/0xf0 [<ffffffff81055f0e>] copy_process+0x188e/0x18b0 [<ffffffff8105609b>] do_fork+0x11b/0x420 [<ffffffff810247f8>] sys_clone+0x28/0x30 [<ffffffff816d3373>] stub_clone+0x13/0x20 [<ffffffffffffffff>] 0xffffffffffffffff ioc should be freed if ioc->icq_list is empty. Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-14 15:34:48 +01:00
Tejun Heo	671058fb2a	block: make blk-throttle preserve the issuing task on delayed bios Make blk-throttle call bio_associate_current() on bios being delayed such that they get issued to block layer with the original io_context. This allows stacking blk-throttle and cfq-iosched propio policies. bios will always be issued with the correct ioc and blkcg whether it gets delayed by blk-throttle or not. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	4f85cb96d9	block: make block cgroup policies follow bio task association Implement bio_blkio_cgroup() which returns the blkcg associated with the bio if exists or %current's blkcg, and use it in blk-throttle and cfq-iosched propio. This makes both cgroup policies honor task association for the bio instead of always assuming %current. As nobody is using bio_set_task() yet, this doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	852c788f83	block: implement bio_associate_current() IO scheduling and cgroup are tied to the issuing task via io_context and cgroup of %current. Unfortunately, there are cases where IOs need to be routed via a different task which makes scheduling and cgroup limit enforcement applied completely incorrectly. For example, all bios delayed by blk-throttle end up being issued by a delayed work item and get assigned the io_context of the worker task which happens to serve the work item and dumped to the default block cgroup. This is double confusing as bios which aren't delayed end up in the correct cgroup and makes using blk-throttle and cfq propio together impossible. Any code which punts IO issuing to another task is affected which is getting more and more common (e.g. btrfs). As both io_context and cgroup are firmly tied to task including userland visible APIs to manipulate them, it makes a lot of sense to match up tasks to bios. This patch implements bio_associate_current() which associates the specified bio with %current. The bio will record the associated ioc and blkcg at that point and block layer will use the recorded ones regardless of which task actually ends up issuing the bio. bio release puts the associated ioc and blkcg. It grabs and remembers ioc and blkcg instead of the task itself because task may already be dead by the time the bio is issued making ioc and blkcg inaccessible and those are all block layer cares about. elevator_set_req_fn() is updated such that the bio elvdata is being allocated for is available to the elevator. This doesn't update block cgroup policies yet. Further patches will implement the support. -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in rq_ioc() to fix build breakage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	f6e8d01bee	block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get\|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	24acfc34fb	block: interface update for ioc/icq creation functions Make the following interface updates to prepare for future ioc related changes. * create_io_context() returning ioc only works for %current because it doesn't increment ref on the ioc. Drop @task parameter from it and always assume %current. * Make create_io_context_slowpath() return 0 or -errno and rename it to create_task_io_context(). * Make ioc_create_icq() take @ioc as parameter instead of assuming that of %current. The caller, get_request(), is updated to create ioc explicitly and then pass it into ioc_create_icq(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	b679281a64	block: restructure get_request() get_request() is structured a bit unusually in that failure path is inlined in the usual flow with goto labels atop and inside it. Relocate the error path to the end of the function. This is to prepare for icq handling changes in get_request() and doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	c875f4d025	blkcg: drop unnecessary RCU locking Now that blkg additions / removals are always done under both q and blkcg locks, the only places RCU locking is necessary are blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops unncessary RCU locking replacing it with plain blkcg locking as necessary. * blkiocg_pre_destroy() already perform proper locking and don't need RCU. Dropped. * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read lock. This isn't a hot path. * Now unnecessary synchronize_rcu() from queue exit paths removed. This makes q->nr_blkgs unnecessary. Dropped. * RCU annotation on blkg->q removed. -v2: Vivek pointed out that blkg_lookup_create() still needs to be called under rcu_read_lock(). Updated. -v3: After the update, stats_lock locking in blkio_read_blkg_stats() shouldn't be using _irq variant as it otherwise ends up enabling irq while blkcg->lock is locked. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	9f13ef678e	blkcg: use double locking instead of RCU for blkg synchronization blkgs are chained from both blkcgs and request_queues and thus subjected to two locks - blkcg->lock and q->queue_lock. As both blkcg and q can go away anytime, locking during removal is tricky. It's currently solved by wrapping removal inside RCU, which makes the synchronization complex. There are three locks to worry about - the outer RCU, q lock and blkcg lock, and it leads to nasty subtle complications like conditional synchronize_rcu() on queue exit paths. For all other paths, blkcg lock is naturally nested inside q lock and the only exception is blkcg removal path, which is a very cold path and can be implemented as clumsy but conceptually-simple reverse double lock dancing. This patch updates blkg removal path such that blkgs are removed while holding both q and blkcg locks, which is trivial for request queue exit path - blkg_destroy_all(). The blkcg removal path, blkiocg_pre_destroy(), implements reverse double lock dancing essentially identical to ioc_release_fn(). This simplifies blkg locking - no half-dead blkgs to worry about. Now unnecessary RCU annotations will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	e8989fae38	blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	03aa264ac1	blkcg: let blkcg core manage per-queue blkg list and counter With the previous patch to move blkg list heads and counters to request_queue and blkg, logic to manage them in both policies are almost identical and can be moved to blkcg core. This patch moves blkg link logic into blkg_lookup_create(), implements common blkg unlink code in blkg_destroy(), and updates blkg_destory_all() so that it's policy specific and can skip root group. The updated blkg_destroy_all() is now used to both clear queue for bypassing and elv switching, and release all blkgs on q exit. This patch introduces a race window where policy [de]registration may race against queue blkg clearing. This can only be a problem on cfq unload and shouldn't be a real problem in practice (and we have many other places where this race already exists). Future patches will remove these unlikely races. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	4eef304998	blkcg: move per-queue blkg list heads and counters to queue and blkg Currently, specific policy implementations are responsible for maintaining list and number of blkgs. This duplicates code unnecessarily, and hinders factoring common code and providing blkcg API with better defined semantics. After this patch, request_queue hosts list heads and counters and blkg has list nodes for both policies. This patch only relocates the necessary fields and the next patch will actually move management code into blkcg core. Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to have 2 elements. This is to avoid include dependency and will be removed by the next patch. This patch doesn't introduce any behavior change. -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed as pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	c1768268f9	blkcg: don't use blkg->plid in stat related functions blkg is scheduled to be unified for all policies and thus there won't be one-to-one mapping from blkg to policy. Update stat related functions to take explicit @pol or @plid arguments and not use blkg->plid. This is painful for now but most of specific stat interface functions will be replaced with a handful of generic helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	549d3aa872	blkcg: make blkg->pd an array and move configuration and stats into it To prepare for unifying blkgs for different policies, make blkg->pd an array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats, and ->stats_cpu into blkg_policy_data. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	1adaf3dde3	blkcg: move refcnt to blkcg core Currently, blkcg policy implementations manage blkg refcnt duplicating mostly identical code in both policies. This patch moves refcnt to blkg and let blkcg core handle refcnt and freeing of blkgs. * cfq blkgs now also get freed via RCU. * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If necessary, we can add blkio_exit_group_fn() to resurrect this. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	0381411e4b	blkcg: let blkcg core handle policy private data allocation Currently, blkg's are embedded in private data blkcg policy private data structure and thus allocated and freed by policies. This leads to duplicate codes in policies, hinders implementing common part in blkcg core with strong semantics, and forces duplicate blkg's for the same cgroup-q association. This patch introduces struct blkg_policy_data which is a separate data structure chained from blkg. Policies specifies the amount of private data it needs in its blkio_policy_type->pdata_size and blkcg core takes care of allocating them along with blkg which can be accessed using blkg_to_pdata(). blkg can be determined from pdata using pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated to blkio_init_group_fn(). For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in the reverse direction are added. Except that policy specific data now lives in a separate data structure from blkg, this patch doesn't introduce any functional difference. This will be used to unify blkg's for different policies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	923adde1be	blkcg: clear all request_queues on blkcg policy [un]registrations Keep track of all request_queues which have blkcg initialized and turn on bypass and invoke blkcg_clear_queue() on all before making changes to blkcg policies. This is to prepare for moving blkg management into blkcg core. Note that this uses more brute force than necessary. Finer grained shoot down will be implemented later and given that policy [un]registration almost never happens on running systems (blk-throtl can't be built as a module and cfq usually is the builtin default iosched), this shouldn't be a problem for the time being. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	5efd611351	blkcg: add blkcg_{init\|drain\|exit}_queue() Currently block core calls directly into blk-throttle for init, drain and exit. This patch adds blkcg_{init\|drain\|exit}_queue() which wraps the blk-throttle functions. This is to give more control and visiblity to blkcg core layer for proper layering. Further patches will add logic common to blkcg policies to the functions. While at it, collapse blk_throtl_release() into blk_throtl_exit(). There's no reason to keep them separate. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	7ee9c56205	blkcg: let blkio_group point to blkio_cgroup directly Currently, blkg points to the associated blkcg via its css_id. This unnecessarily complicates dereferencing blkcg. Let blkg hold a reference to the associated blkcg and point directly to it and disable css_id on blkio_subsys. This change requires splitting blkiocg_destroy() into blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be destroyed and all the blkcg references held by them dropped during cgroup removal. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Vivek Goyal	92616b5b3a	blkcg: skip blkg printing if q isn't associated with disk blk-cgroup printing code currently assumes that there is a device/disk associated with every queue in the system, but modules like floppy, can instantiate request queues without registering disk which can lead to oops. Skip the queue/blkg which don't have dev/disk associated with them. -tj: Factored out backing_dev_info check into blkg_dev_name(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	7a4dd281ec	blkcg: kill the mind-bending blkg->dev blkg->dev is dev_t recording the device number of the block device for the associated request_queue. It is used to identify the associated block device when printing out configuration or stats. This is redundant to begin with. A blkg is an association between a cgroup and a request_queue and it of course is possible to reach request_queue from blkg and synchronization conventions are in place for safe q dereferencing, so this shouldn't be necessary from the beginning. Furthermore, it's initialized by sscanf()ing the device name of backing_dev_info. The mind boggles. Anyways, if blkg is visible under rcu lock, we know that the associated request_queue hasn't gone away yet and its bdi is registered and alive - blkg can't be created for request_queue which hasn't been fully initialized and it can't go away before blkg is removed. Let stat and conf read functions get device name from blkg->q->backing_dev_info.dev and pass it down to printing functions and remove blkg->dev. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	4bfd482e73	blkcg: kill blkio_policy_node Now that blkcg configuration lives in blkg's, blkio_policy_node is no longer necessary. Kill it. blkio_policy_parse_and_set() now fails if invoked for missing device and functions to print out configurations are updated to print from blkg's. cftype_blkg_same_policy() is dropped along with other policy functions for consistency. Its one line is open coded in the only user - blkio_read_blkg_stats(). -v2: Update to reflect the retry-on-bypass logic change of the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	e56da7e287	blkcg: don't allow or retain configuration of missing devices blkcg is very peculiar in that it allows setting and remembering configurations for non-existent devices by maintaining separate data structures for configuration. This behavior is completely out of the usual norms and outright confusing; furthermore, it uses dev_t number to match the configuration to devices, which is unpredictable to begin with and becomes completely unuseable if EXT_DEVT is fully used. It is wholely unnecessary - we already have fully functional userland mechanism to program devices being hotplugged which has full access to device identification, connection topology and filesystem information. Add a new struct blkio_group_conf which contains all blkcg configurations to blkio_group and let blkio_group, which can be created iff the associated device exists and is removed when the associated device goes away, carry all configurations. Note that, after this patch, all newly created blkg's will always have the default configuration (unlimited for throttling and blkcg's weight for propio). This patch makes blkio_policy_node meaningless but doesn't remove it. The next patch will. -v2: Updated to retry after short sleep if blkg lookup/creation failed due to the queue being temporarily bypassed as indicated by -EBUSY return. Pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	cd1604fab4	blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	f51b802c17	blkcg: use the usual get blkg path for root blkio_group For root blkg, blk_throtl_init() was using throtl_alloc_tg() explicitly and cfq_init_queue() was manually initializing embedded cfqd->root_group, adding unnecessarily different code paths to blkg handling. Make both use the usual blkio_group get functions - throtl_get_tg() and cfq_get_cfqg() - for the root blkio_group too. Note that blk_throtl_init() callsite is pushed downwards in blk_alloc_queue_node() so that @q is sufficiently initialized for throtl_get_tg(). This simplifies root blkg handling noticeably for cfq and will allow further modularization of blkcg API. -v2: Vivek pointed out that using cfq_get_cfqg() won't work if CONFIG_CFQ_GROUP_IOSCHED is disabled. Fix it by factoring out initialization of base part of cfqg into cfq_init_cfqg_base() and alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	035d10b2fa	blkcg: add blkio_policy[] array and allow one policy per policy ID Block cgroup policies are maintained in a linked list and, theoretically, multiple policies sharing the same policy ID are allowed. This patch temporarily restricts one policy per plid and adds blkio_policy[] array which indexes registered policy types by plid. Both the restriction and blkio_policy[] array are transitional and will be removed once API cleanup is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	ca32aefc7f	blkcg: use q and plid instead of opaque void * for blkio_group association blkgio_group is association between a block cgroup and a queue for a given policy. Using opaque void * for association makes things confusing and hinders factoring of common code. Use request_queue * and, if necessary, policy id instead. This will help block cgroup API cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	0a5a7d0e32	blkcg: update blkg get functions take blkio_cgroup as parameter In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(), instead of obtaining blkcg of %current explicitly, let the caller specify the blkcg to use as parameter and make both functions hold on to the blkcg. This is part of block cgroup interface cleanup and will help making blkcg API more modular. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	2a7f124414	blkcg: move rcu_read_lock() outside of blkio_group get functions rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto @blkcg while looking up blkg. For API cleanup, the next patch will make the caller responsible for determining @blkcg to look blkg from and let them specify it as a parameter. Move rcu read locking out to the callers to prepare for the change. -v2: Originally this patch was described as a fix for RCU read locking bug around @blkg, which Vivek pointed out to be incorrect. It was from misunderstanding the role of rcu locking as protecting @blkg not @blkcg. Patch description updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	72e06c2551	blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	6ecf23afab	block: extend queue bypassing to cover blkcg policies Extend queue bypassing such that dying queue is always bypassing and blk-throttle is drained on bypass. With blkcg policies updated to test blk_queue_bypass() instead of blk_queue_dead(), this ensures that no bio or request is held by or going through blkcg policies on a bypassing queue. This will be used to implement blkg cleanup on elevator switches and policy changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	d732580b4e	block: implement blk_queue_bypass_start/end() Rename and extend elv_queisce_start/end() to blk_queue_bypass_start/end() which are exported and supports nesting via @q->bypass_depth. Also add blk_queue_bypass() to test bypass state. This will be further extended and used for blkio_group management. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	b2fab5acd2	elevator: make elevator_init_fn() return 0/-errno elevator_ops->elevator_init_fn() has a weird return value. It returns a void * which the caller should assign to q->elevator->elevator_data and %NULL return denotes init failure. Update such that it returns integer 0/-errno and sets elevator_data directly as necessary. This makes the interface more conventional and eases further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	5a5bafdc39	elevator: clear auxiliary data earlier during elevator switch Elevator switch tries hard to keep as much as context until new elevator is ready so that it can revert to the original state if initializing the new elevator fails for some reason. Unfortunately, with more auxiliary contexts to manage, this makes elevator init and exit paths too complex and fragile. This patch makes elevator_switch() unregister the current elevator and flush icq's before start initializing the new one. As we still keep the old elevator itself, the only difference is that we lose icq's on rare occassions of switching failure, which isn't critical at all. Note that this makes explicit elevator parameter to elevator_init_queue() and __elv_register_queue() unnecessary as they always can use the current elevator. This patch enables block cgroup cleanups. -v2: blk_add_trace_msg() prints elevator name from @new_e instead of @e->type as the local variable no longer exists. This caused build failure on CONFIG_BLK_DEV_IO_TRACE. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	b95ada558c	cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED is disabled. This fortunately doesn't collide with blk-throtl as BLKIO_POLICY_PROP is zero but is unnecessary and risky. Just don't register it if not enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	32e380aedc	blkcg: make CONFIG_BLK_CGROUP bool Block cgroup core can be built as module; however, it isn't too useful as blk-throttle can only be built-in and cfq-iosched is usually the default built-in scheduler. Scheduled blkcg cleanup requires calling into blkcg from block core. To simplify that, disallow building blkcg as module by making CONFIG_BLK_CGROUP bool. If building blkcg core as module really matters, which I doubt, we can revisit it after blkcg API cleanup. -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to depend on BLK_CGROUP. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	b855b04a0b	block: blk-throttle should be drained regardless of q->elevator Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if q->elevator doesn't exist; however, bio based drivers don't have elevator initialized but can still use blk-throttle. This patch moves q->elevator test inside blk_drain_queue() such that only elv_drain_elevator() is skipped if !q->elevator. -v2: loop can have registered queue which has NULL request_fn. Make sure we don't call into __blk_run_queue() in such cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vivek Goyal <vgoyal@redhat.com> Fold in bug fix from Vivek. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:24:55 +01:00
Alan Stern	62d3c5439c	Block: use a freezable workqueue for disk-event polling This patch (as1519) fixes a bug in the block layer's disk-events polling. The polling is done by a work routine queued on the system_nrt_wq workqueue. Since that workqueue isn't freezable, the polling continues even in the middle of a system sleep transition. Obviously, polling a suspended drive for media changes and such isn't a good thing to do; in the case of USB mass-storage devices it can lead to real problems requiring device resets and even re-enumeration. The patch fixes things by creating a new system-wide, non-reentrant, freezable workqueue and using it for disk-events polling. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: <stable@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-02 10:51:00 +01:00
Stanislaw Gruszka	9f53d2fe81	block: fix __blkdev_get and add_disk race condition The following situation might occur: __blkdev_get: add_disk: register_disk() get_gendisk() disk_block_events() disk->ev == NULL disk_add_events() __disk_unblock_events() disk->ev != NULL --ev->block Then we unblock events, when they are suppose to be blocked. This can trigger events related block/genhd.c warnings, but also can crash in sd_check_events() or other places. I'm able to reproduce crashes with the following scripts (with connected usb dongle as sdb disk). <snip> DEV=/dev/sdb ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue function stop_me() { for i in `jobs -p` ; do kill $i 2> /dev/null ; done exit } trap stop_me SIGHUP SIGINT SIGTERM for ((i = 0; i < 10; i++)) ; do while true; do fdisk -l $DEV 2>&1 > /dev/null ; done & done while true ; do echo 1 > $ENABLE sleep 1 echo 0 > $ENABLE done </snip> I use the script to verify patch fixing oops in sd_revalidate_disk http://marc.info/?l=linux-scsi&m=132935572512352&w=2 Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in sd_revalidate_disk" or this one, script easily crash kernel within a few seconds. With both patches applied I do not observe crash. Unfortunately after some time (dozen of minutes), script will hung in: [ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20 [ 1563.906437] [<c04532d5>] msleep+0x15/0x20 [ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0 [ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170 [ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60 [ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0 [ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60 [ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0 [ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage] [ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage] Anyway I think this patch is some step forward. As drawback, I do not teardown on sysfs file create error, because I do not know how to nullify disk->ev (since it can be used). However add_disk error handling practically does not exist too, and things will work without this sysfs file, except events will not be exported to user space. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-02 10:44:17 +01:00
Jun'ichi Nomura	fe316bf2d5	block: Fix NULL pointer dereference in sd_revalidate_disk Since 2.6.39 (`1196f8b`), when a driver returns -ENOMEDIUM for open(), __blkdev_get() calls rescan_partitions() to remove in-kernel partition structures and raise KOBJ_CHANGE uevent. However it ends up calling driver's revalidate_disk without open and could cause oops. In the case of SCSI: process A process B ---------------------------------------------- sys_open __blkdev_get sd_open returns -ENOMEDIUM scsi_remove_device <scsi_device torn down> rescan_partitions sd_revalidate_disk <oops> Oopses are reported here: http://marc.info/?l=linux-scsi&m=132388619710052 This patch separates the partition invalidation from rescan_partitions() and use it for -ENOMEDIUM case. Reported-by: Huajun Li <huajun.li.lee@gmail.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-02 10:38:33 +01:00
Ingo Molnar	7e4d960993	Merge branch 'linus' into sched/core Merge reason: we'll queue up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-01 10:26:43 +01:00
Anton Altaparmakov	97387e3baa	LDM: Fix reassembly of extended VBLKs. From: Ben Hutchings <ben@decadent.org.uk> Extended VBLKs (those larger than the preset VBLK size) are divided into fragments, each with its own VBLK header. Our LDM implementation generally assumes that each VBLK is contiguous in memory, so these fragments must be assembled before further processing. Currently the reassembly seems to be done quite wrongly - no VBLK header is copied into the contiguous buffer, and the length of the header is subtracted twice from each fragment. Also the total length of the reassembled VBLK is calculated incorrectly. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Anton Altaparmakov <anton@tuxera.com>	2012-02-24 09:37:42 +00:00
Tejun Heo	621032ad6e	block: exit_io_context() should call elevator_exit_icq_fn() While updating locking, `b2efa05265` "block, cfq: unlink cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation from exit_io_context() to the final ioc put. While this doesn't cause catastrophic failure, it effectively removes task exit notification to elevator and cause noticeable IO performance degradation with CFQ. On task exit, CFQ used to immediately expire the slice if it was being used by the exiting task as no more IO would be issued by the task; however, after `b2efa05265`, the notification is lost and disk could sit idle needlessly, leading to noticeable IO performance degradation for certain workloads. This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it from exit_io_context(). ICQ_EXITED flag is added to avoid invoking the callback more than once for the same icq. Walking icq_list from ioc side and invoking elevator callback requires reverse double locking. This may be better implemented using RCU; unfortunately, using RCU isn't trivial. e.g. RCU protection would need to cover request_queue and queue_lock switch on cleanup makes grabbing queue_lock from RCU unsafe. Reverse double locking should do, at least for now. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-bisected-by: Shaohua Li <shli@kernel.org> LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-15 09:45:53 +01:00
Tejun Heo	2274b029f6	block: simplify ioc_release_fn() Reverse double lock dancing in ioc_release_fn() can be simplified by just using trylock on the queue_lock and back out from ioc lock on trylock failure. Simplify it. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-15 09:45:52 +01:00
Tejun Heo	d705ae6b13	block: replace icq->changed with icq->flags icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and access it under ioc->lock instead of using atomic bitops. ioc_get_changed() is added so that the changed part can be fetched and cleared as before. icq->flags will be used to carry other flags. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-15 09:45:49 +01:00
Tejun Heo	d8c66c5d59	block: fix lockdep warning on io_context release put_io_context() `11a3122f6c` "block: strip out locking optimization in put_io_context()" removed ioc_lock depth lockdep annoation along with locking optimization; however, while recursing from put_io_context() is no longer possible, ioc_release_fn() may still end up putting the last reference of another ioc through elevator, which wlil grab ioc->lock triggering spurious (as the ioc is always different one) A-A deadlock warning. As this can only happen one time from ioc_release_fn(), using non-zero subclass from ioc_release_fn() is enough. Use subclass 1. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-11 12:37:25 +01:00
Stanislaw Gruszka	37b40adf2d	bsg: fix sysfs link remove warning We create "bsg" link if q->kobj.sd is not NULL, so remove it only when the same condition is true. Fixes: WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77() sysfs: can not remove 'bsg', no directory Call Trace: [<c0429683>] warn_slowpath_common+0x6a/0x7f [<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77 [<c042970b>] warn_slowpath_fmt+0x2b/0x2f [<c0537a68>] sysfs_hash_and_remove+0x2b/0x77 [<c053969a>] sysfs_remove_link+0x20/0x23 [<c05d88f1>] bsg_unregister_queue+0x40/0x6d [<c0692263>] __scsi_remove_device+0x31/0x9d [<c069149f>] scsi_forget_host+0x41/0x52 [<c0689fa9>] scsi_remove_host+0x71/0xe0 [<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage] [<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage] [<c06c29de>] usb_unbind_interface+0x4e/0x109 [<c067a80f>] __device_release_driver+0x6b/0xa6 [<c067a861>] device_release_driver+0x17/0x22 [<c067a46a>] bus_remove_device+0xd6/0xe6 [<c06785e2>] device_del+0xf2/0x137 [<c06c101f>] usb_disable_device+0x94/0x1a0 Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-08 20:02:03 +01:00
Tejun Heo	07c2bd3735	block: don't call elevator callbacks for plug merges Plug merge calls two elevator callbacks outside queue lock - elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although attempt_plug_merge() suggests that elevator is guaranteed to be there through the existing request on the plug list, nothing prevents plug merge from calling into dying or initializing elevator. For regular merges, bypass ensures elvpriv count to reach zero, which in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER from forced back insertion. Plug merge doesn't check ELVPRIV, and, as the requests haven't gone through elevator insertion yet, it doesn't have SOFTBARRIER set allowing merges on a bypassed queue. This, for example, leads to the following crash during elevator switch. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0 PGD 112cbc067 PUD 115d5c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU 1 Modules linked in: deadline_iosched Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0 RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0 RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8 RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708 R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708 FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0) Stack: ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1 ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e Call Trace: [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60 [<ffffffff81391bf1>] elv_try_merge+0x21/0x40 [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390 [<ffffffff81396a5a>] generic_make_request+0xca/0x100 [<ffffffff81396b04>] submit_bio+0x74/0x100 [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450 [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60 [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760 [<ffffffff811986b2>] do_sync_read+0xe2/0x120 [<ffffffff81199345>] vfs_read+0xc5/0x180 [<ffffffff81199501>] sys_read+0x51/0x90 [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b There are multiple ways to fix this including making plug merge check ELVPRIV; however, * Calling into elevator outside queue lock is confusing and error-prone. * Requests on plug list aren't known to the elevator. They aren't on the elevator yet, so there's no elevator specific state to update. * Given the nature of plug merges - collecting bio's for the same purpose from the same issuer - elevator specific restrictions aren't applicable. So, simply don't call into elevator methods from plug merge by moving elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and using blk_try_merge() in attempt_plug_merge(). This is based on Jens' patch to skip elevator_allow_merge_fn() from plug merge. Note that this makes per-cgroup merged stats skip plug merging. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4F16F3CA.90904@kernel.dk> Original-patch-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-08 09:19:42 +01:00
Tejun Heo	050c8ea80e	block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions blk_rq_merge_ok() is the elevator-neutral part of merge eligibility test. blk_try_merge() determines merge direction and expects the caller to have tested elv_rq_merge_ok() previously. elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls elv_iosched_allow_merge(). elv_try_merge() is removed and the two callers are updated to call elv_rq_merge_ok() explicitly followed by blk_try_merge(). While at it, make rq_merge_ok() functions return bool. This is to prepare for plug merge update and doesn't introduce any behavior change. This is based on Jens' patch to skip elevator_allow_merge_fn() from plug merge. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4F16F3CA.90904@kernel.dk> Original-patch-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-08 09:19:38 +01:00
Tejun Heo	11a3122f6c	block: strip out locking optimization in put_io_context() put_io_context() performed a complex trylock dancing to avoid deferring ioc release to workqueue. It was also broken on UP because trylock was always assumed to succeed which resulted in unbalanced preemption count. While there are ways to fix the UP breakage, even the most pathological microbench (forced ioc allocation and tight fork/exit loop) fails to show any appreciable performance benefit of the optimization. Strip it out. If there turns out to be workloads which are affected by this change, simpler optimization from the discussion thread can be applied later. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-07 07:51:30 +01:00
Shaohua Li	9fa73472dd	block: fix ioc locking warning Meelis reported a warning: WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec() Hardware name: 939Dual-SATA2 timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103 Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd125666 #176 Call Trace: <IRQ> [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96 [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43 [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31 [<ffffffff810145fd>] ? apic_write+0x11/0x13 [<ffffffff81027475>] __do_softirq+0x74/0xfa [<ffffffff812f337a>] call_softirq+0x1a/0x30 [<ffffffff81002ff9>] do_softirq+0x31/0x68 [<ffffffff810276cf>] irq_exit+0x3d/0xa3 [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77 [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70 <EOI> [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d [<ffffffff8100801f>] ? default_idle+0x1e/0x32 [<ffffffff81008019>] ? default_idle+0x18/0x32 [<ffffffff810008b1>] cpu_idle+0x87/0xd1 [<ffffffff812de861>] rest_init+0x85/0x89 [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8 [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82 [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7 this_q == locked_q is possible. There are two problems here: 1. In UP case, there is preemption counter issue as spin_trylock always successes. 2. In SMP case, the loop breaks too earlier. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Knut Petersen <Knut_Petersen@t-online.de> Tested-by: Knut Petersen <Knut_Petersen@t-online.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-06 08:57:29 +01:00
Li Zefan	761b3ef50e	cgroup: remove cgroup_subsys argument from callbacks The argument is not used at all, and it's not necessary, because a specific callback handler of course knows which subsys it belongs to. Now only ->pupulate() takes this argument, because the handlers of this callback always call cgroup_add_file()/cgroup_add_files(). So we reduce a few lines of code, though the shrinking of object size is minimal. 16 files changed, 113 insertions(+), 162 deletions(-) text data bss dec hex filename 5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig 5486170 656987 7039960 13183117 c9288d vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-02-02 09:20:22 -08:00
Peter Zijlstra	39be350127	sched, block: Unify cache detection The block layer has some code trying to determine if two CPUs share a cache, the scheduler has a similar function. Expose the function used by the scheduler and make the block layer use it, thereby removing the block layers usage of CONFIG_SCHED* and topology bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jens Axboe <axboe@kernel.dk> Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins	2012-01-27 13:28:48 +01:00
Shaohua Li	05c30b9551	block: fix NULL icq_cache reference Vivek reported a kernel crash: [ 94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c [ 94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200 [ 94.218004] PGD 13abda067 PUD 137d52067 PMD 0 [ 94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 94.218004] CPU 0 [ 94.218004] Modules linked in: [last unloaded: scsi_wait_scan] [ 94.218004] [ 94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch [ 94.218004] RIP: 0010:[<ffffffff81142fae>] [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200 [ 94.218004] RSP: 0018:ffff88013fc03de0 EFLAGS: 00010006 [ 94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b [ 94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae [ 94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001 [ 94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246 [ 94.218004] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 94.218004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0 [ 94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020) [ 94.218004] Stack: [ 94.218004] 0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00 [ 94.218004] 0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6 [ 94.218004] ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240 [ 94.218004] Call Trace: [ 94.218004] <IRQ> [ 94.218004] [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20 [ 94.218004] [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420 [ 94.218004] [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250 [ 94.218004] [<ffffffff810405ee>] __do_softirq+0xce/0x3e0 [ 94.218004] [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100 [ 94.218004] [<ffffffff81090104>] ? tick_program_event+0x24/0x30 [ 94.218004] [<ffffffff8183ed1c>] call_softirq+0x1c/0x30 [ 94.218004] [<ffffffff8100422d>] do_softirq+0x8d/0xc0 [ 94.218004] [<ffffffff81040c3e>] irq_exit+0xae/0xe0 [ 94.218004] [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99 [ 94.218004] [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80 Once a queue is quiesced, it's not supposed to have any elvpriv data or icq's, and elevator switching depends on that. Request alloc path followed the rule for elvpriv data but forgot apply it to icq's leading to the following crash during elevator switch. Fix it by not allocating icq's if ELVPRIV is not set for the request. Reported-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-19 09:20:10 +01:00
Shaohua Li	df0793abb9	block,cfq: change code order cfq_slice_expired will change saved_workload_slice. It should be called first so saved_workload_slice is correctly set to 0 after workload type is changed. This fixes the code order changed by `54b466e44b`. Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-19 09:20:09 +01:00
Jens Axboe	54b466e44b	cfq-iosched: fix use-after-free of cfqq With the changes in life time management between the cfq IO contexts and the cfq queues, we now risk having cfqd->active_queue being freed when cfq_slice_expired() is being called. cfq_preempt_queue() caches this queue and uses it after calling said function, causing a use-after-free condition. This triggers the following oops, when cfqq_type() attempts to dereference it: BUG: unable to handle kernel paging request at ffff8800746c4f0c IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20 PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU 3 Modules linked in: Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs RIP: 0010:[<ffffffff81266d59>] [<ffffffff81266d59>] cfqq_type+0xb/0x20 RSP: 0018:ffff880079c11778 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000 RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08 RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001 R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0 R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002 FS: 00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040) Stack: ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43 ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98 0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149 Call Trace: [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31 [<ffffffff81254149>] __elv_add_request+0x1ca/0x200 [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312 [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0 [<ffffffff8125907b>] submit_bio+0xbf/0xca [<ffffffff81136ec7>] submit_bh+0xdf/0xfe [<ffffffff81176d04>] ext3_bread+0x50/0x99 [<ffffffff811785b3>] dx_probe+0x38/0x291 [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219 [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406 [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0 [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191 [<ffffffff8117a330>] ext3_lookup+0x39/0xe1 [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5 [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef [<ffffffff8111b557>] path_lookupat+0x59/0x5ea [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a [<ffffffff8111bce0>] do_path_lookup+0x23/0x59 [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99 [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56 [<ffffffff8111d02d>] user_path_at+0x11/0x13 [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64 [<ffffffff8111425a>] vfs_stat+0x1b/0x1d [<ffffffff81114359>] sys_newstat+0x1a/0x33 [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42 [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3 RIP [<ffffffff81266d59>] cfqq_type+0xb/0x20 RSP <ffff880079c11778> CR2: ffff8800746c4f0c Get rid of the caching of cfqd->active_queue, and reorder the check so that it happens before we expire the active queue. Thanks to Tejun for pin pointing the error location. Reported-by: Chris Mason <chris.mason@oracle.com> Tested-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-17 21:26:11 +01:00
Linus Torvalds	b3c9dd182e	Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits) Revert "block: recursive merge requests" block: Stop using macro stubs for the bio data integrity calls blockdev: convert some macros to static inlines fs: remove unneeded plug in mpage_readpages() block: Add BLKROTATIONAL ioctl block: Introduce blk_set_stacking_limits function block: remove WARN_ON_ONCE() in exit_io_context() block: an exiting task should be allowed to create io_context block: ioc_cgroup_changed() needs to be exported block: recursive merge requests block, cfq: fix empty queue crash caused by request merge block, cfq: move icq creation and rq->elv.icq association to block core block, cfq: restructure io_cq creation path for io_context interface cleanup block, cfq: move io_cq exit/release to blk-ioc.c block, cfq: move icq cache management to block core block, cfq: move io_cq lookup to blk-ioc.c block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq block, cfq: reorganize cfq_io_context into generic and cfq specific parts block: remove elevator_queue->ops block: reorder elevator switch sequence ... Fix up conflicts in: - block/blk-cgroup.c Switch from can_attach_task to can_attach - block/cfq-iosched.c conflict with now removed cic index changes (we now use q->id instead)	2012-01-15 12:24:45 -08:00
Jens Axboe	5d381efb3d	Revert "block: recursive merge requests" This reverts commit `274193224c`. We have some problems related to selection of empty queues that need to be resolved, evidence so far points to the recursive merge logic making either being the cause or at least the accelerator for this. So revert it for now, until we figure this out. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-15 10:29:48 +01:00
Paolo Bonzini	0bfc96cb77	block: fail SCSI passthrough ioctls on partition devices Linux allows executing the SG_IO ioctl on a partition or LVM volume, and will pass the command to the underlying block device. This is well-known, but it is also a large security problem when (via Unix permissions, ACLs, SELinux or a combination thereof) a program or user needs to be granted access only to part of the disk. This patch lets partitions forward a small set of harmless ioctls; others are logged with printk so that we can see which ioctls are actually sent. In my tests only CDROM_GET_CAPABILITY actually occurred. Of course it was being sent to a (partition on a) hard disk, so it would have failed with ENOTTY and the patch isn't changing anything in practice. Still, I'm treating it specially to avoid spamming the logs. In principle, this restriction should include programs running with CAP_SYS_RAWIO. If for example I let a program access /dev/sda2 and /dev/sdb, it still should not be able to read/write outside the boundaries of /dev/sda2 independent of the capabilities. However, for now programs with CAP_SYS_RAWIO will still be allowed to send the ioctls. Their actions will still be logged. This patch does not affect the non-libata IDE driver. That driver however already tests for bd != bd->bd_contains before issuing some ioctl; it could be restricted further to forbid these ioctls even for programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO. Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: James Bottomley <JBottomley@parallels.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [ Make it also print the command name when warning - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-14 15:07:24 -08:00
Paolo Bonzini	577ebb374c	block: add and use scsi_blk_cmd_ioctl Introduce a wrapper around scsi_cmd_ioctl that takes a block device. The function will then be enhanced to detect partition block devices and, in that case, subject the ioctls to whitelisting. Cc: linux-scsi@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: James Bottomley <JBottomley@parallels.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-14 15:07:24 -08:00
Martin K. Petersen	ef00f59c95	block: Add BLKROTATIONAL ioctl Introduce an ioctl which permits applications to query whether a block device is rotational. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-11 16:29:31 +01:00
Martin K. Petersen	b1bd055d39	block: Introduce blk_set_stacking_limits function Stacking driver queue limits are typically bounded exclusively by the capabilities of the low level devices, not by the stacking driver itself. This patch introduces blk_set_stacking_limits() which has more liberal metrics than the default queue limits function. This allows us to inherit topology parameters from bottom devices without manually tweaking the default limits in each driver prior to calling the stacking function. Since there is now a clear distinction between stacking and low-level devices, blk_set_default_limits() has been modified to carry the more conservative values that we used to manually set in blk_queue_make_request(). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-11 16:27:11 +01:00
Linus Torvalds	db0c2bf69a	Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cgroup: fix to allow mounting a hierarchy by name cgroup: move assignement out of condition in cgroup_attach_proc() cgroup: Remove task_lock() from cgroup_post_fork() cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end() cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static cgroup: only need to check oldcgrp==newgrp once cgroup: remove redundant get/put of task struct cgroup: remove redundant get/put of old css_set from migrate cgroup: Remove unnecessary task_lock before fetching css_set on migration cgroup: Drop task_lock(parent) on cgroup_fork() cgroups: remove redundant get/put of css_set from css_set_check_fetched() resource cgroups: remove bogus cast cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() cgroup, cpuset: don't use ss->pre_attach() cgroup: don't use subsys->can_attach_task() or ->attach_task() cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() cgroup: improve old cgroup handling in cgroup_attach_proc() cgroup: always lock threadgroup during migration threadgroup: extend threadgroup_lock() to cover exit and exec threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem ... Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups: fix a css_set not found bug in cgroup_attach_proc" that already mentioned that the bug is fixed (differently) in Tejun's cgroup patchset. This one, in other words.	2012-01-09 12:59:24 -08:00
Linus Torvalds	972b2c7199	Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits) reiserfs: Properly display mount options in /proc/mounts vfs: prevent remount read-only if pending removes vfs: count unlinked inodes vfs: protect remounting superblock read-only vfs: keep list of mounts for each superblock vfs: switch ->show_options() to struct dentry * vfs: switch ->show_path() to struct dentry * vfs: switch ->show_devname() to struct dentry * vfs: switch ->show_stats to struct dentry * switch security_path_chmod() to struct path * vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb vfs: trim includes a bit switch mnt_namespace ->root to struct mount vfs: take /proc//mounts and friends to fs/proc_namespace.c vfs: opencode mntget() mnt_set_mountpoint() vfs: spread struct mount - remaining argument of next_mnt() vfs: move fsnotify junk to struct mount vfs: move mnt_devname vfs: move mnt_list to struct mount vfs: switch pnode.h macros to struct mount ...	2012-01-08 12:19:57 -08:00
Al Viro	ece2ccb668	Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z	2012-01-06 23:15:54 -05:00
Linus Torvalds	07d106d0a3	vfs: fix up ENOIOCTLCMD error handling We're doing some odd things there, which already messes up various users (see the net/socket.c code that this removes), and it was going to add yet more crud to the block layer because of the incorrect error code translation. ENOIOCTLCMD is not an error return that should be returned to user mode from the "ioctl()" system call, but it should not be translated as EINVAL ("Invalid argument"). It should be translated as ENOTTY ("Inappropriate ioctl for device"). That EINVAL confusion has apparently so permeated some code that the block layer actually checks for it, which is sad. We continue to do so for now, but add a big comment about how wrong that is, and we should remove it entirely eventually. In the meantime, this tries to keep the changes localized to just the EINVAL -> ENOTTY fix, and removing code that makes it harder to do the right thing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-05 15:40:12 -08:00
Al Viro	2c9ede55ec	switch device_get_devnode() and ->devnode() to umode_t * both callers of device_get_devnode() are only interested in lower 16bits and nobody tries to return anything wider than 16bit anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:55 -05:00
Al Viro	ff01bb4832	fs: move code out of buffer.c Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:07 -05:00
Al Viro	94ea4158f1	separate partition format handling from generic code Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:06 -05:00
Al Viro	9be96f3fd1	move fs/partitions to block/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:06 -05:00
Al Viro	4752bc309b	make register_disk() static Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:05 -05:00
Dan Williams	f2b20d4365	block: fix blk_queue_end_tag() Commit `5e081591` "block: warn if tag is greater than real_max_depth" cleaned up blk_queue_end_tag() to warn when the tag is truly invalid (greater than real_max_depth). However, it changed behavior in the tag < max_depth case to not end the request. Leading to triggering of BUG_ON(blk_queued_rq(rq)) in the request completion path: http://marc.info/?l=linux-kernel&m=132204370518629&w=2 In order to allow blk_queue_resize_tags() to shrink the tag space blk_queue_end_tag() must always complete tags with a value less than real_max_depth regardless of the current max_depth. The comment about "handling the shrink case" seems to be what prompted changes in this space, so remove it and BUG on all invalid tags (made even simpler by Matthew's suggestion to use an unsigned compare). Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Tao Ma <boyu.mt@taobao.com> Cc: Matthew Wilcox <matthew@wil.cx> Reported-by: Meelis Roos <mroos@ut.ee> Reported-by: Ed Nadolski <edmund.nadolski@intel.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-29 09:16:28 +01:00
Tejun Heo	c98b2cc29a	block: remove WARN_ON_ONCE() in exit_io_context() `6e736be7` "block: make ioc get/put interface more conventional and fix race on alloction" added WARN_ON_ONCE() in exit_io_context() which triggers if !PF_EXITING. All tasks hitting exit_io_context() from task exit should have PF_EXITING set but task struct tearing down after fork failure calls into the function without PF_EXITING, triggering the condition. WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92() Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77 Call Trace: [<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2 [<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a [<ffffffff8181a7a2>] exit_io_context+0x40/0x92 [<ffffffff810b58c9>] copy_process+0x126f/0x1453 [<ffffffff810b5c1b>] do_fork+0x120/0x3e9 [<ffffffff8106242f>] sys_clone+0x26/0x28 [<ffffffff82425803>] stub_clone+0x13/0x20 ---[ end trace a2e4eb670b375238 ]--- Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-27 18:52:16 +01:00
Tejun Heo	fd63836811	block: an exiting task should be allowed to create io_context While fixing io_context creation / task exit race condition, `6e736be7f2` "block: make ioc get/put interface more conventional and fix race on alloction" also prevented an exiting (%PF_EXITING) task from creating its own io_context. This is incorrect as exit path may issue IOs, e.g. from exit_files(), and if those IOs are the first ones issued by the task, io_context needs to be created to process the IOs. Combined with the existing problem of io_context / io_cq creation failure having the possibility of stalling IO, this problem results in deterministic full IO lockup with certain workloads. Fix it by allowing io_context creation regardless of %PF_EXITING for %current. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrew Morton <akpm@linux-foundation.org> Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-25 14:29:14 +01:00
majianpeng	609f6ea1c9	block: re-use existing 'reading' variable instead of checking direction again Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-21 15:27:24 +01:00
Jens Axboe	64c42998f1	block: ioc_cgroup_changed() needs to be exported With the ioc changed, ioc_cgroup_changed() can be used by modular code. So ensure that it is exported. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-19 10:36:44 +01:00
Shaohua Li	6ae0516b8a	block, cfq: fix empty queue crash caused by request merge All requests of a queue could be merged to other requests of other queue. Such queue will not have request in it, but it's in service tree. This will cause kernel oops. I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the issue should exist without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-16 14:04:23 +01:00
Shaohua Li	274193224c	block: recursive merge requests In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,.... When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged. With recursive merge below, the workload throughput gets improved 20% and context switch drops 60%. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-16 14:00:31 +01:00
Shaohua Li	4a0b75c7d0	block, cfq: fix empty queue crash caused by request merge All requests of a queue could be merged to other requests of other queue. Such queue will not have request in it, but it's in service tree. This will cause kernel oops. I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the issue should exist without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-16 14:00:22 +01:00
Tejun Heo	4eabc94125	block: don't kick empty queue in blk_drain_queue() While probing, fd sets up queue, probes hardware and tears down the queue if probing fails. In the process, blk_drain_queue() kicks the queue which failed to finish initialization and fd is unhappy about that. floppy0: no floppy controllers found ------------[ cut here ]------------ WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0() Hardware name: To Be Filled By O.E.M. VFS: do_fd_request called on non-open device Modules linked in: Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2 Call Trace: [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50 [<ffffffff813d657f>] do_fd_request+0xbf/0xd0 [<ffffffff81322b95>] blk_drain_queue+0x65/0x80 [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0 [<ffffffff818a809d>] floppy_init+0xdeb/0xe28 [<ffffffff818a72b2>] ? daring+0x6b/0x6b [<ffffffff810002af>] do_one_initcall+0x3f/0x170 [<ffffffff81884b34>] kernel_init+0x9d/0x11e [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0 [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10 [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be [<ffffffff815dbb10>] ? gs_change+0xb/0xb Avoid it by making blk_drain_queue() kick queue iff dispatch queue has something on it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Sergei Trofimovich <slyich@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-15 20:03:04 +01:00
Tejun Heo	f1f8cc9465	block, cfq: move icq creation and rq->elv.icq association to block core Now block layer knows everything necessary to create and associate icq's with requests. Move ioc_create_icq() to blk-ioc.c and update get_request() such that, if elevator_type->icq_size is set, requests are automatically associated with their matching icq's before elv_set_request(). io_context reference is also managed by block core on request alloc/free. * Only ioprio/cgroup changed handling remains from cfq_get_cic(). Collapsed into cfq_set_request(). * This removes queue kicking on icq allocation failure (for now). As icq allocation failure is rare and the only effect of queue kicking achieved was possibily accelerating queue processing, this change shouldn't be noticeable. There is a larger underlying problem. Unlike request allocation, icq allocation is not guaranteed to succeed eventually after retries. The number of icq is unbound and thus mempool can't be the solution either. This effectively adds allocation dependency on memory free path and thus possibility of deadlock. This usually wouldn't happen because icq allocation is not a hot path and, even when the condition triggers, it's highly unlikely that none of the writeback workers already has icq. However, this is still possible especially if elevator is being switched under high memory pressure, so we better get it fixed. Probably the only solution is just bypassing elevator and appending to dispatch queue on any elevator allocation failure. * Comment added to explain how icq's are managed and synchronized. This completes cleanup of io_context interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	9b84cacd01	block, cfq: restructure io_cq creation path for io_context interface cleanup Add elevator_ops->elevator_init_icq_fn() and restructure cfq_create_cic() and rename it to ioc_create_icq(). The new function expects its caller to pass in io_context, uses elevator_type->icq_cache, handles generic init, calls the new elevator operation for elevator specific initialization, and returns pointer to created or looked up icq. This leaves cfq_icq_pool variable without any user. Removed. This prepares for io_context interface cleanup and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	7e5a879449	block, cfq: move io_cq exit/release to blk-ioc.c With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc and q, and freeing automatically handled by blk-ioc. The elevator operation only need to perform exit operation specific to the elevator - in cfq's case, exiting the cfqq's. Also, clearing of io_cq's on q detach is moved to block core and automatically performed on elevator switch and q release. Because the q io_cq points to might be freed before RCU callback for the io_cq runs, blk-ioc code should remember to which cache the io_cq needs to be freed when the io_cq is released. New field io_cq->__rcu_icq_cache is added for this purpose. As both the new field and rcu_head are used only after io_cq is released and the q/ioc_node fields aren't, they are put into unions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	3d3c2379fe	block, cfq: move icq cache management to block core Let elevators set ->icq_size and ->icq_align in elevator_type and elv_register() and elv_unregister() respectively create and destroy kmem_cache for icq. * elv_register() now can return failure. All callers updated. * icq caches are automatically named "ELVNAME_io_cq". * cfq_slab_setup/kill() are collapsed into cfq_init/exit(). * While at it, minor indentation change for iosched_cfq.elevator_name for consistency. This will help moving icq management to block core. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	47fdd4ca96	block, cfq: move io_cq lookup to blk-ioc.c Now that all io_cq related data structures are in block core layer, io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c. Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with parameter return type changes (cfqd -> request_queue, cfq_io_cq -> io_cq) and cfq_cic_lookup() becomes thin wrapper around cfq_cic_lookup(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	a612fddf0d	block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq Most of icq management is about to be moved out of cfq into blk-ioc. This patch prepares for it. * Move cfqd->icq_list to request_queue->icq_list * Make request explicitly point to icq instead of through elevator private data. ->elevator_private[3] is replaced with sub struct elv which contains icq pointer and priv[2]. cfq is updated accordingly. * Meaningless clearing of ->elevator_private[0] removed from elv_set_request(). At that point in code, the field was guaranteed to be %NULL anyway. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	c586980732	block, cfq: reorganize cfq_io_context into generic and cfq specific parts Currently io_context and cfq logics are mixed without clear boundary. Most of io_context is independent from cfq but cfq_io_context handling logic is dispersed between generic ioc code and cfq. cfq_io_context represents association between an io_context and a request_queue, which is a concept useful outside of cfq, but it also contains fields which are useful only to cfq. This patch takes out generic part and put it into io_cq (io context-queue) and the rest into cfq_io_cq (cic moniker remains the same) which contains io_cq. The following changes are made together. * cfq_ttime and cfq_io_cq now live in cfq-iosched.c. * All related fields, functions and constants are renamed accordingly. * ioc->ioc_data is now "struct io_cq " instead of "void " and renamed to icq_hint. This prepares for io_context API cleanup. Documentation is currently sparse. It will be added later. Changes in this patch are mechanical and don't cause functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	22f746e235	block: remove elevator_queue->ops elevator_queue->ops points to the same ops struct ->elevator_type.ops is pointing to. The only effect of caching it in elevator_queue is shorter notation - it doesn't save any indirect derefence. Relocate elevator_type->list which used only during module init/exit to the end of the structure, rename elevator_queue->elevator_type to ->type, and replace elevator_queue->ops with elevator_queue->type.ops. This doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	f8fc877d3c	block: reorder elevator switch sequence Elevator switch sequence first attached the new elevator, then tried registering it (sysfs) and if that failed attached back the old elevator. However, sysfs registration doesn't require the elevator to be attached, so there is no reason to do the "detach, attach new, register, maybe re-attach old" sequence. It can just do "register, detach, attach". * elevator_init_queue() is updated to set ->elevator_data directly and return 0 / -errno. This allows elevator_exit() on an unattached elevator. * __elv_unregister_queue() which was necessary to unregister unattached q is removed in favor of __elv_register_queue() which can register unattached q. * elevator_attach() becomes a single assignment and obscures more then it helps. Dropped. This will help cleaning up io_context handling across elevator switch. This patch doesn't introduce visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00

1 2 3 4 5 ...

1829 Commits