Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Device-mapper Development (http://www.linux-archive.org/device-mapper-development/)
-   -   RFC: I/O bandwidth controller (was Too many I/O controller patches) (http://www.linux-archive.org/device-mapper-development/152599-rfc-i-o-bandwidth-controller-too-many-i-o-controller-patches.html)

Fernando Luis Vzquez Cao 08-06-2008 01:13 AM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:
> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
> > This series of patches of dm-ioband now includes "The bio tracking mechanism,"
> > which has been posted individually to this mailing list.
> > This makes it easy for anybody to control the I/O bandwidth even when
> > the I/O is one of delayed-write requests.
>
> During the Containers mini-summit at OLS, it was mentioned that there
> are at least *FOUR* of these I/O controllers floating around. Have you
> talked to the other authors? (I've cc'd at least one of them).
>
> We obviously can't come to any kind of real consensus with people just
> tossing the same patches back and forth.
>
> -- Dave

Hi Dave,

I have been tracking the memory controller patches for a while which
spurred my interest in cgroups and prompted me to start working on I/O
bandwidth controlling mechanisms. This year I have had several
opportunities to discuss the design challenges of i/o controllers with
the NEC and VALinux Japan teams (CCed), most recently last month during
the Linux Foundation Japan Linux Symposium, where we took advantage of
Andrew Morton's visit to Japan to do some brainstorming on this topic. I
will try so summarize what was discussed there (and in the Linux Storage
& Filesystem Workshop earlier this year) and propose a hopefully
acceptable way to proceed and try to get things started.

This RFC ended up being a bit longer than I had originally intended, but
hopefully it will serve as the start of a fruitful discussion.

As you pointed out, it seems that there is not much consensus building
going on, but that does not mean there is a lack of interest. To get the
ball rolling it is probably a good idea to clarify the state of things
and try to establish what we are trying to accomplish.

*** State of things in the mainstream kernel<BR>
The kernel has had somewhat adavanced I/O control capabilities for quite
some time now: CFQ. But the current CFQ has some problems:
- I/O priority can be set by PID, PGRP, or UID, but...
- ...all the processes that fall within the same class/priority are
scheduled together and arbitrary grouping are not possible.
- Buffered I/O is not handled properly.
- CFQ's IO priority is an attribute of a process that affects all
devices it sends I/O requests to. In other words, with the current
implementation it is not possible to assign per-device IO priorities to
a task.

*** Goals
1. Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
entity).
2. Being able to perform I/O bandwidth control independently on each
device.
3. I/O bandwidth shaping.
4. Scheduler-independent I/O bandwidth control.
5. Usable with stacking devices (md, dm and other devices of that
ilk).
6. I/O tracking (handle buffered and asynchronous I/O properly).

The list of goals above is not exhaustive and it is also likely to
contain some not-so-nice-to-have features so your feedback would be
appreciated.

1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
identity)

We obviously need this because our final goal is to be able to control
the IO generated by a Linux container. The good news is that we already
have the cgroups infrastructure so, regarding this problem, we would
just have to transform our I/O bandwidth controller into a cgroup
subsystem.

This seems to be the easiest part, but the current cgroups
infrastructure has some limitations when it comes to dealing with block
devices: impossibility of creating/removing certain control structures
dynamically and hardcoding of subsystems (i.e. resource controllers).
This makes it difficult to handle block devices that can be hotplugged
and go away at any time (this applies not only to usb storage but also
to some SATA and SCSI devices). To cope with this situation properly we
would need hotplug support in cgroups, but, as suggested before and
discussed in the past (see (0) below), there are some limitations.

Even in the non-hotplug case it would be nice if we could treat each
block I/O device as an independent resource, which means we could do
things like allocating I/O bandwidth on a per-device basis. As long as
performance is not compromised too much, adding some kind of basic
hotplug support to cgroups is probably worth it.

(0) http://lkml.org/lkml/2008/5/21/12

3. & 4. & 5. - I/O bandwidth shaping & General design aspects

The implementation of an I/O scheduling algorithm is to a certain extent
influenced by what we are trying to achieve in terms of I/O bandwidth
shaping, but, as discussed below, the required accuracy can determine
the layer where the I/O controller has to reside. Off the top of my
head, there are three basic operations we may want perform:
- I/O nice prioritization: ionice-like approach.
- Proportional bandwidth scheduling: each process/group of processes
has a weight that determines the share of bandwidth they receive.
- I/O limiting: set an upper limit to the bandwidth a group of tasks
can use.

If we are pursuing a I/O prioritization model la CFQ the temptation is
to implement it at the elevator layer or extend any of the existing I/O
schedulers.

There have been several proposals that extend either the CFQ scheduler
(see (1), (2) below) or the AS scheduler (see (3) below). The problem
with these controllers is that they are scheduler dependent, which means
that they become unusable when we change the scheduler or when we want
to control stacking devices which define their own make_request_fn
function (md and dm come to mind). It could be argued that the physical
devices controlled by a dm or md driver are likely to be fed by
traditional I/O schedulers such as CFQ, but these I/O schedulers would
be running independently from each other, each one controlling its own
device ignoring the fact that they part of a stacking device. This lack
of information at the elevator layer makes it pretty difficult to obtain
accurate results when using stacking devices. It seems that unless we
can make the elevator layer aware of the topology of stacking devices
(possibly by extending the elevator API?) evelator-based approaches do
not constitute a generic solution. Here onwards, for discussion
purposes, I will refer to this type of I/O bandwidth controllers as
elevator-based I/O controllers.

A simple way of solving the problems discussed in the previous paragraph
is to perform I/O control before the I/O actually enters the block layer
either at the pagecache level (when pages are dirtied) or at the entry
point to the generic block layer (generic_make_request()). Andrea's I/O
throttling patches stick to the former variant (see (4) below) and
Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
approach. The rationale is that by hooking into the source of I/O
requests we can perform I/O control in a topology-agnostic and
elevator-agnostic way. I will refer to this new type of I/O bandwidth
controller as block layer I/O controller.

By residing just above the generic block layer the implementation of a
block layer I/O controller becomes relatively easy, but by not taking
into account the characteristics of the underlying devices we might risk
underutilizing them. For this reason, in some cases it would probably
make sense to complement a generic I/O controller with elevator-based
I/O controller, so that the maximum throughput can be squeezed from the
physical devices.

(1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
(2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
(3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
(4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
(5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581

6.- I/O tracking

This is arguably the most important part, since to perform I/O control
we need to be able to determine where the I/O is coming from.

Reads are trivial because they are served in the context of the task
that generated the I/O. But most writes are performed by pdflush,
kswapd, and friends so performing I/O control just in the synchronous
I/O path would lead to large inaccuracy. To get this right we would need
to track ownership all the way up to the pagecache page. In other words,
it is necessary to track who is dirtying pages so that when they are
written to disk the right task is charged for that I/O.

Fortunately, such tracking of pages is one of the things the existing
memory resource controller is doing to control memory usage. This is a
clever observation which has a useful implication: if the rather
imbricated tracking and accounting parts of the memory resource
controller were split the I/O controller could leverage the existing
infrastructure to track buffered and asynchronous I/O. This is exactly
what the bio-cgroup (see (6) below) patches set out to do.

It is also possible to do without I/O tracking. For that we would need
to hook into the synchronous I/O path and every place in the kernel
where pages are dirtied (see (4) above for details). However controlling
the rate at which a cgroup can generate dirty pages seems to be a task
that belongs in the memory controller not the I/O controller. As Dave
and Paul suggested its probably better to delegate this to the memory
controller. In fact, it seems that Yamamoto-san is cooking some patches
that implement just that: dirty balancing for cgroups (see (7) for
details).

Another argument in favor of I/O tracking is that not only block layer
I/O controllers would benefit from it, but also the existing I/O
schedulers and the elevator-based I/O controllers proposed by
Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
are working on this and hopefully will be sending patches soon).

(6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
(7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/

*** How to move on

As discussed before, it probably makes sense to have both a block layer
I/O controller and a elevator-based one, and they could certainly
cohabitate. As discussed before, all of them need I/O tracking
capabilities so I would like to suggest the plan below to get things
started:

- Improve the I/O tracking patches (see (6) above) until they are in
mergeable shape.
- Fix CFQ and AS to use the new I/O tracking functionality to show its
benefits. If the performance impact is acceptable this should suffice to
convince the respective maintainer and get the I/O tracking patches
merged.
- Implement a block layer resource controller. dm-ioband is a working
solution and feature rich but its dependency on the dm infrastructure is
likely to find opposition (the dm layer does not handle barriers
properly and the maximum size of I/O requests can be limited in some
cases). In such a case, we could either try to build a standalone
resource controller based on dm-ioband (which would probably hook into
generic_make_request) or try to come up with something new.
- If the I/O tracking patches make it into the kernel we could move on
and try to get the Cgroup extensions to CFQ and AS mentioned before (see
(1), (2), and (3) above for details) merged.
- Delegate the task of controlling the rate at which a task can
generate dirty pages to the memory controller.

This RFC is somewhat vague but my feeling is that we build some
consensus on the goals and basic design aspects before delving into
implementation details.

I would appreciate your comments and feedback.

- Fernando

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Dave Hansen 08-06-2008 06:00 PM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote:
> Would you like to split up IO into read and write IO. We know that read can be
> very latency sensitive when compared to writes. Should we consider them
> separately in the RFC?

I'd just suggest doing what is simplest and can be done in the smallest
amount of code. As long as it is functional in some way and can be
extended to cover the end goal, I say keep it tiny.

> > Even in the non-hotplug case it would be nice if we could treat each
> > block I/O device as an independent resource, which means we could do
> > things like allocating I/O bandwidth on a per-device basis. As long as
> > performance is not compromised too much, adding some kind of basic
> > hotplug support to cgroups is probably worth it.
>
> Won't that get too complex. What if the user has thousands of disks with several
> partitions on each?

I think what Fernando is suggesting is that we *allow* each disk to be
treated separately, not that we actually separate them out. I agree
that with large disk count systems, it would get a bit nutty to deal
with I/O limits on each of them. It would also probably be nutty for
some dude with two disks in his system to have to set (or care about)
individual limits.

I guess I'm just arguing that we should allow pretty arbitrary grouping
of block devices into these resource pools.

-- Dave

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

"Naveen Gupta" 08-06-2008 07:37 PM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
Fernando

Nice summary. My comments are inline.

-Naveen

2008/8/5 Fernando Luis Vzquez Cao <fernando@oss.ntt.co.jp>:
> On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:
>> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>> > This series of patches of dm-ioband now includes "The bio tracking mechanism,"
>> > which has been posted individually to this mailing list.
>> > This makes it easy for anybody to control the I/O bandwidth even when
>> > the I/O is one of delayed-write requests.
>>
>> During the Containers mini-summit at OLS, it was mentioned that there
>> are at least *FOUR* of these I/O controllers floating around. Have you
>> talked to the other authors? (I've cc'd at least one of them).
>>
>> We obviously can't come to any kind of real consensus with people just
>> tossing the same patches back and forth.
>>
>> -- Dave
>
> Hi Dave,
>
> I have been tracking the memory controller patches for a while which
> spurred my interest in cgroups and prompted me to start working on I/O
> bandwidth controlling mechanisms. This year I have had several
> opportunities to discuss the design challenges of i/o controllers with
> the NEC and VALinux Japan teams (CCed), most recently last month during
> the Linux Foundation Japan Linux Symposium, where we took advantage of
> Andrew Morton's visit to Japan to do some brainstorming on this topic. I
> will try so summarize what was discussed there (and in the Linux Storage
> & Filesystem Workshop earlier this year) and propose a hopefully
> acceptable way to proceed and try to get things started.
>
> This RFC ended up being a bit longer than I had originally intended, but
> hopefully it will serve as the start of a fruitful discussion.
>
> As you pointed out, it seems that there is not much consensus building
> going on, but that does not mean there is a lack of interest. To get the
> ball rolling it is probably a good idea to clarify the state of things
> and try to establish what we are trying to accomplish.
>
> *** State of things in the mainstream kernel<BR>
> The kernel has had somewhat adavanced I/O control capabilities for quite
> some time now: CFQ. But the current CFQ has some problems:
> - I/O priority can be set by PID, PGRP, or UID, but...
> - ...all the processes that fall within the same class/priority are
> scheduled together and arbitrary grouping are not possible.
> - Buffered I/O is not handled properly.
> - CFQ's IO priority is an attribute of a process that affects all
> devices it sends I/O requests to. In other words, with the current
> implementation it is not possible to assign per-device IO priorities to
> a task.
>
> *** Goals
> 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
> 2. Being able to perform I/O bandwidth control independently on each
> device.
> 3. I/O bandwidth shaping.
> 4. Scheduler-independent I/O bandwidth control.
> 5. Usable with stacking devices (md, dm and other devices of that
> ilk).
> 6. I/O tracking (handle buffered and asynchronous I/O properly).
>
> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
>
> 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> identity)
>
> We obviously need this because our final goal is to be able to control
> the IO generated by a Linux container. The good news is that we already
> have the cgroups infrastructure so, regarding this problem, we would
> just have to transform our I/O bandwidth controller into a cgroup
> subsystem.
>
> This seems to be the easiest part, but the current cgroups
> infrastructure has some limitations when it comes to dealing with block
> devices: impossibility of creating/removing certain control structures
> dynamically and hardcoding of subsystems (i.e. resource controllers).
> This makes it difficult to handle block devices that can be hotplugged
> and go away at any time (this applies not only to usb storage but also
> to some SATA and SCSI devices). To cope with this situation properly we
> would need hotplug support in cgroups, but, as suggested before and
> discussed in the past (see (0) below), there are some limitations.
>
> Even in the non-hotplug case it would be nice if we could treat each
> block I/O device as an independent resource, which means we could do
> things like allocating I/O bandwidth on a per-device basis. As long as
> performance is not compromised too much, adding some kind of basic
> hotplug support to cgroups is probably worth it.
>
> (0) http://lkml.org/lkml/2008/5/21/12
>
> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>
> The implementation of an I/O scheduling algorithm is to a certain extent
> influenced by what we are trying to achieve in terms of I/O bandwidth
> shaping, but, as discussed below, the required accuracy can determine
> the layer where the I/O controller has to reside. Off the top of my
> head, there are three basic operations we may want perform:
> - I/O nice prioritization: ionice-like approach.
> - Proportional bandwidth scheduling: each process/group of processes
> has a weight that determines the share of bandwidth they receive.
> - I/O limiting: set an upper limit to the bandwidth a group of tasks
> can use.

I/O limiting can be a special case of proportional bandwidth
scheduling. A process/process group can use use it's share of
bandwidth and if there is spare bandwidth it be allowed to use it. And
if we want to absolutely restrict it we add another flag which
specifies that the specified proportion is exact and has an upper
bound.

Let's say the ideal b/w for a device is 100MB/s

And process 1 is assigned b/w of 20%. When we say that the proportion
is strict, the b/w for process 1 will be 20% of the max b/w (which may
be less than 100MB/s) subject to a max of 20MB/s.


>
> If we are pursuing a I/O prioritization model la CFQ the temptation is
> to implement it at the elevator layer or extend any of the existing I/O
> schedulers.
>
> There have been several proposals that extend either the CFQ scheduler
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which means
> that they become unusable when we change the scheduler or when we want
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physical
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers would
> be running independently from each other, each one controlling its own
> device ignoring the fact that they part of a stacking device. This lack
> of information at the elevator layer makes it pretty difficult to obtain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches do
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.

It can be argued that any scheduling decision wrt to i/o belongs to
elevators. Till now they have been used to improve performance. But
with new requirements to isolate i/o based on process or cgroup, we
need to change the elevators.

If we add another layer of i/o scheduling (block layer I/O controller)
above elevators
1) It builds another layer of i/o scheduling (bandwidth or priority)
2) This new layer can have decisions for i/o scheduling which conflict
with underlying elevator. e.g. If we decide to do b/w scheduling in
this new layer, there is no way a priority based elevator could work
underneath it.

If a custom make_request_fn is defined (which means the said device is
not using existing elevator), it could build it's own scheduling
rather than asking kernel to add another layer at the time of i/o
submission. Since it has complete control of i/o.

>
> A simple way of solving the problems discussed in the previous paragraph
> is to perform I/O control before the I/O actually enters the block layer
> either at the pagecache level (when pages are dirtied) or at the entry
> point to the generic block layer (generic_make_request()). Andrea's I/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
>
> By residing just above the generic block layer the implementation of a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might risk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from the
> physical devices.
>
> (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
>
> 6.- I/O tracking
>
> This is arguably the most important part, since to perform I/O control
> we need to be able to determine where the I/O is coming from.
>
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would need
> to track ownership all the way up to the pagecache page. In other words,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
>
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactly
> what the bio-cgroup (see (6) below) patches set out to do.
>
> It is also possible to do without I/O tracking. For that we would need
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controlling
> the rate at which a cgroup can generate dirty pages seems to be a task
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patches
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
>
> Another argument in favor of I/O tracking is that not only block layer
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
> are working on this and hopefully will be sending patches soon).
>
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
>
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>
> I would appreciate your comments and feedback.
>
> - Fernando
>
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Fernando Luis Vzquez Cao 08-07-2008 02:44 AM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote:
> > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > identity)
> >
> > We obviously need this because our final goal is to be able to control
> > the IO generated by a Linux container. The good news is that we already
> > have the cgroups infrastructure so, regarding this problem, we would
> > just have to transform our I/O bandwidth controller into a cgroup
> > subsystem.
> >
> > This seems to be the easiest part, but the current cgroups
> > infrastructure has some limitations when it comes to dealing with block
> > devices: impossibility of creating/removing certain control structures
> > dynamically and hardcoding of subsystems (i.e. resource controllers).
> > This makes it difficult to handle block devices that can be hotplugged
> > and go away at any time (this applies not only to usb storage but also
> > to some SATA and SCSI devices). To cope with this situation properly we
> > would need hotplug support in cgroups, but, as suggested before and
> > discussed in the past (see (0) below), there are some limitations.
> >
> > Even in the non-hotplug case it would be nice if we could treat each
> > block I/O device as an independent resource, which means we could do
> > things like allocating I/O bandwidth on a per-device basis. As long as
> > performance is not compromised too much, adding some kind of basic
> > hotplug support to cgroups is probably worth it.
> >
>
> Won't that get too complex. What if the user has thousands of disks with several
> partitions on each?
As Dave pointed out I just think that we should allow each disk to be
treated separately. To avoid the administration nightmare you mention
adding block device grouping capabilities should suffice to solve most
of the issues.

> > 6.- I/O tracking
> >
> > This is arguably the most important part, since to perform I/O control
> > we need to be able to determine where the I/O is coming from.
> >
> > Reads are trivial because they are served in the context of the task
> > that generated the I/O. But most writes are performed by pdflush,
> > kswapd, and friends so performing I/O control just in the synchronous
> > I/O path would lead to large inaccuracy. To get this right we would need
> > to track ownership all the way up to the pagecache page. In other words,
> > it is necessary to track who is dirtying pages so that when they are
> > written to disk the right task is charged for that I/O.
> >
> > Fortunately, such tracking of pages is one of the things the existing
> > memory resource controller is doing to control memory usage. This is a
> > clever observation which has a useful implication: if the rather
> > imbricated tracking and accounting parts of the memory resource
> > controller were split the I/O controller could leverage the existing
> > infrastructure to track buffered and asynchronous I/O. This is exactly
> > what the bio-cgroup (see (6) below) patches set out to do.
> >
>
> Are you suggesting that the IO and memory controller should always be bound
> together?
That is a really good question. The I/O tracking patches split the
memory controller in two functional parts: (1) page tracking and (2)
memory accounting/cgroup policy enforcement. By doing so the memory
controller specific code can be separated from the rest, which
admittedly, will not benefit the memory controller a great deal but,
hopefully, we can get cleaner code that is easier to maintain.

The important thing, though, is that with this separation the page
tracking bits can be easily reused by any subsystem that needs to keep
track of pages, and the I/O controller is certainly one such candidate.
Synchronous I/O is easy to deal with because everything is done in the
context of the task that generated the I/O, but buffered I/O and
synchronous I/O are problematic. However with the observation that the
owner of an I/O request happens to be the owner the of the pages the I/O
buffers of that request reside in, it becomes clear that pdflush and
friends could use that information to determine who the originator of
the I/O is and the I/O request accordingly.

Going back to your question, with the current I/O tracking patches I/O
controller would be bound to the page tracking functionality of cgroups
(page_cgroup) not the memory controller. We would not even need to
compile the memory controller. The dependency on cgroups would still be
there though.

As an aside, I guess that with some effort we could get rid of this
dependency by providing some basic tracking capabilities even when the
cgroups infrastructure is not being used. By doing so traditional I/O
schedulers such as CFQ could benefit from proper I/O tracking
capabilities without using cgroups. Of course if the kernel has cgroups
support compiled in the cgroups I/O tracking would be used instead (this
idea was inpired by CFS' group scheduling, which works both with and
without cgroups support). I am currently trying to implement this.

> > (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90

> > *** How to move on
> >
> > As discussed before, it probably makes sense to have both a block layer
> > I/O controller and a elevator-based one, and they could certainly
> > cohabitate. As discussed before, all of them need I/O tracking
> > capabilities so I would like to suggest the plan below to get things
> > started:
> >
> > - Improve the I/O tracking patches (see (6) above) until they are in
> > mergeable shape.
>
> Yes, I agree with this step as being the first step. May be extending the
> current task I/O accounting to cgroups could be done as a part of this.
Yes, makes sense.

> > - Fix CFQ and AS to use the new I/O tracking functionality to show its
> > benefits. If the performance impact is acceptable this should suffice to
> > convince the respective maintainer and get the I/O tracking patches
> > merged.
> > - Implement a block layer resource controller. dm-ioband is a working
> > solution and feature rich but its dependency on the dm infrastructure is
> > likely to find opposition (the dm layer does not handle barriers
> > properly and the maximum size of I/O requests can be limited in some
> > cases). In such a case, we could either try to build a standalone
> > resource controller based on dm-ioband (which would probably hook into
> > generic_make_request) or try to come up with something new.
> > - If the I/O tracking patches make it into the kernel we could move on
> > and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> > (1), (2), and (3) above for details) merged.
> > - Delegate the task of controlling the rate at which a task can
> > generate dirty pages to the memory controller.
> >
> > This RFC is somewhat vague but my feeling is that we build some
> > consensus on the goals and basic design aspects before delving into
> > implementation details.
> >
> > I would appreciate your comments and feedback.
>
> Very nice summary
Thank you!

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Fernando Luis Vzquez Cao 08-07-2008 03:01 AM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote:
> > *** Goals
> > 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > entity).
> > 2. Being able to perform I/O bandwidth control independently on each
> > device.
> > 3. I/O bandwidth shaping.
> > 4. Scheduler-independent I/O bandwidth control.
> > 5. Usable with stacking devices (md, dm and other devices of that
> > ilk).
> > 6. I/O tracking (handle buffered and asynchronous I/O properly).
> >
> > The list of goals above is not exhaustive and it is also likely to
> > contain some not-so-nice-to-have features so your feedback would be
> > appreciated.
> >
>
> Would you like to split up IO into read and write IO. We know that read can be
> very latency sensitive when compared to writes. Should we consider them
> separately in the RFC?
Oops, I somehow ended up leaving your first question unanswered. Sorry.

I do not think we should consider them separately, as long as there is a
proper IO tracking infrastructure in place. As you mentioned, reads can
be very latecy sensitive, but the read case could be treated as an
special case IO controller/IO tracking subsystem. There certainly are
optimization opportunities. For example, in the synchronous I/O patch ww
could mark bios with the iocontext of the current task, because it will
happen to be originator of that IO. By effectively caching the ownership
information in the bio we can avoid all the accesses to struct page,
page_cgroup, etc, and reads would definitively benefit from that.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Fernando Luis Vzquez Cao 08-07-2008 01:17 PM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
Hi Naveen,

On Wed, 2008-08-06 at 12:37 -0700, Naveen Gupta wrote:
> > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
> >
> > The implementation of an I/O scheduling algorithm is to a certain extent
> > influenced by what we are trying to achieve in terms of I/O bandwidth
> > shaping, but, as discussed below, the required accuracy can determine
> > the layer where the I/O controller has to reside. Off the top of my
> > head, there are three basic operations we may want perform:
> > - I/O nice prioritization: ionice-like approach.
> > - Proportional bandwidth scheduling: each process/group of processes
> > has a weight that determines the share of bandwidth they receive.
> > - I/O limiting: set an upper limit to the bandwidth a group of tasks
> > can use.
>
> I/O limiting can be a special case of proportional bandwidth
> scheduling. A process/process group can use use it's share of
> bandwidth and if there is spare bandwidth it be allowed to use it. And
> if we want to absolutely restrict it we add another flag which
> specifies that the specified proportion is exact and has an upper
> bound.
>
> Let's say the ideal b/w for a device is 100MB/s
>
> And process 1 is assigned b/w of 20%. When we say that the proportion
> is strict, the b/w for process 1 will be 20% of the max b/w (which may
> be less than 100MB/s) subject to a max of 20MB/s.
I essentially agree with you. The nice thing about proportional
bandwidth scheduling is that we get bandwidth guarantees when there is
contention for the block device, but still get the benefits of
statistical multiplexing in the non-contended case. With strict IO
limiting we risk underusing the block devices.

> > If we are pursuing a I/O prioritization model la CFQ the temptation is
> > to implement it at the elevator layer or extend any of the existing I/O
> > schedulers.
> >
> > There have been several proposals that extend either the CFQ scheduler
> > (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> > with these controllers is that they are scheduler dependent, which means
> > that they become unusable when we change the scheduler or when we want
> > to control stacking devices which define their own make_request_fn
> > function (md and dm come to mind). It could be argued that the physical
> > devices controlled by a dm or md driver are likely to be fed by
> > traditional I/O schedulers such as CFQ, but these I/O schedulers would
> > be running independently from each other, each one controlling its own
> > device ignoring the fact that they part of a stacking device. This lack
> > of information at the elevator layer makes it pretty difficult to obtain
> > accurate results when using stacking devices. It seems that unless we
> > can make the elevator layer aware of the topology of stacking devices
> > (possibly by extending the elevator API?) evelator-based approaches do
> > not constitute a generic solution. Here onwards, for discussion
> > purposes, I will refer to this type of I/O bandwidth controllers as
> > elevator-based I/O controllers.
>
> It can be argued that any scheduling decision wrt to i/o belongs to
> elevators. Till now they have been used to improve performance. But
> with new requirements to isolate i/o based on process or cgroup, we
> need to change the elevators.
I have the impression there is a tendency to conflate two different
issues when discussing I/O schedulers and resource controllers, so let
me elaborate on this point.

On the one hand, we have the problem of feeding physical devices with IO
requests in such a way that we squeeze the maximum performance out of
them. Of course in some cases we may want to prioritize responsiveness
over throughput. In either case the kernel has to perform the same basic
operations: merging and dispatching IO requests. There is no discussion
this is the elevator's job and the elevator should take into account the
physical characteristics of the device.

On the other hand, there is the problem of sharing an IO resource, i.e.
block device, between multiple tasks or groups of tasks. There are many
ways of sharing an IO resource depending on what we are trying to
accomplish: proportional bandwidth scheduling, priority-based
scheduling, etc. But to implement this sharing algorithms the kernel has
to determine the task whose IO will be submitted. In a sense, we are
scheduling tasks (and groups of tasks) not IO requests (which has much
in common with CPU scheduling). Besides, the sharing problem is not
directly related to the characteristics of the underlying device, which
means it does not need to be implemented at the elevator layer.

Traditional elevators limit themselves to schedule IO requests to disk
with no regard to where it came from. However, new IO schedulers such as
CFQ combine this with IO prioritization capabilities. This means that
the elevator decides the application whose IO will be dispatched next.
The problem is that at this layer there is not enough information to
make such decisions in an accurate way, because, as mentioned in the
RFC, the elevator has not way to know the block IO topology. The
implication of this is that the elevator does not know the impact a
particular scheduling decision will make in the IO throughput seen by
applications, which is what users care about.

For all these reasons, I think the elevator should take care of
optimizing the last stretch of the IO path (generic block layer -> block
device) for performance/responsiveness, and leave the job of ensuring
that each task is guaranteed a fair share of the kernel's IO resources
to the upper layers (for example a block layer resource controller).

I recognize that in some cases global performance could be improved if
the block layer had access to information from the elevator, and that is
why I mentioned in the RFC that in some cases it might make sense to
combine a block layer resource controller and a elevator layer one (we
just would need to figure out a way for the to communicate with each
other and work well in tandem).

> If we add another layer of i/o scheduling (block layer I/O controller)
> above elevators
> 1) It builds another layer of i/o scheduling (bandwidth or priority)
As I mentioned before we are trying to achieve two things: making the
best use of block devices, and sharing those IO resources between tasks
or groups of tasks. There are two possible approaches here: implement
everything in the elevator or move the sharing bits somewhere above the
elevator layer. In either case we have to carry out the same tasks so
the impact of delegating part of the work to a new layer should not be
that big, and, hopefully, will improve maintainability.

> 2) This new layer can have decisions for i/o scheduling which conflict
> with underlying elevator. e.g. If we decide to do b/w scheduling in
> this new layer, there is no way a priority based elevator could work
> underneath it.
The priority system could be implemented above the elevator layer in the
block layer resource controller, which means that the elevator would
only have to worry about scheduling the requests it receives from the
block layer and dispatching them to disk in the best possible way.

An alternative would be using a block layer resource controller and a
elavator-based resource controller in tandem.

> If a custom make_request_fn is defined (which means the said device is
> not using existing elevator),
Please note that each of the block devices that constitute a stacking
device could have its own elevator.

> it could build it's own scheduling
> rather than asking kernel to add another layer at the time of i/o
> submission. Since it has complete control of i/o.
I think that is something we should avoid. The IO scheduling behavior
that the user sees should not depend on the topology of the system. We
certainly do not want to reimplement the same scheduling algorithm for
every RAID driver. I am of the opinion that whatever IO scheduling
algorithm we choose should be implemented just once and usable under any
IO configuration.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Fernando Luis Vzquez Cao 08-07-2008 01:59 PM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
Hi Andrea!

On Thu, 2008-08-07 at 09:46 +0200, Andrea Righi wrote:
> Fernando Luis Vzquez Cao wrote:
> > This RFC ended up being a bit longer than I had originally intended, but
> > hopefully it will serve as the start of a fruitful discussion.
>
> Thanks for posting this detailed RFC! A few comments below.
>
> > As you pointed out, it seems that there is not much consensus building
> > going on, but that does not mean there is a lack of interest. To get the
> > ball rolling it is probably a good idea to clarify the state of things
> > and try to establish what we are trying to accomplish.
> >
> > *** State of things in the mainstream kernel<BR>
> > The kernel has had somewhat adavanced I/O control capabilities for quite
> > some time now: CFQ. But the current CFQ has some problems:
> > - I/O priority can be set by PID, PGRP, or UID, but...
> > - ...all the processes that fall within the same class/priority are
> > scheduled together and arbitrary grouping are not possible.
> > - Buffered I/O is not handled properly.
> > - CFQ's IO priority is an attribute of a process that affects all
> > devices it sends I/O requests to. In other words, with the current
> > implementation it is not possible to assign per-device IO priorities to
> > a task.
> >
> > *** Goals
> > 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > entity).
> > 2. Being able to perform I/O bandwidth control independently on each
> > device.
> > 3. I/O bandwidth shaping.
> > 4. Scheduler-independent I/O bandwidth control.
> > 5. Usable with stacking devices (md, dm and other devices of that
> > ilk).
> > 6. I/O tracking (handle buffered and asynchronous I/O properly).
>
> The same above also for IO operations/sec (bandwidth intended not only
> in terms of bytes/sec), plus:
>
> 7. Optimal bandwidth usage: allow to exceed the IO limits to take
> advantage of free/unused IO resources (i.e. allow "bursts" when the
> whole physical bandwidth for a block device is not fully used and then
> "throttle" again when IO from unlimited cgroups comes into place)
>
> 8. "fair throttling": avoid to throttle always the same task within a
> cgroup, but try to distribute the throttling among all the tasks
> belonging to the throttle cgroup

Thank you for the ideas!

By the way, point "3." above (I/O bandwidth shaping) refers to IO
scheduling algorithms in general. When I wrote the RFC I thought that
once we have the IO tracking and accounting mechanisms in place choosing
and implementing an algorithm (fair throttling, proportional bandwidth
scheduling, etc) would be easy in comparison, and that is the reason a
list was not included.

Once I get more feedback from all of you I will resend a more detailed
RFC that will include your suggestions.

> > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > identity)
> >
> > We obviously need this because our final goal is to be able to control
> > the IO generated by a Linux container. The good news is that we already
> > have the cgroups infrastructure so, regarding this problem, we would
> > just have to transform our I/O bandwidth controller into a cgroup
> > subsystem.
> >
> > This seems to be the easiest part, but the current cgroups
> > infrastructure has some limitations when it comes to dealing with block
> > devices: impossibility of creating/removing certain control structures
> > dynamically and hardcoding of subsystems (i.e. resource controllers).
> > This makes it difficult to handle block devices that can be hotplugged
> > and go away at any time (this applies not only to usb storage but also
> > to some SATA and SCSI devices). To cope with this situation properly we
> > would need hotplug support in cgroups, but, as suggested before and
> > discussed in the past (see (0) below), there are some limitations.
> >
> > Even in the non-hotplug case it would be nice if we could treat each
> > block I/O device as an independent resource, which means we could do
> > things like allocating I/O bandwidth on a per-device basis. As long as
> > performance is not compromised too much, adding some kind of basic
> > hotplug support to cgroups is probably worth it.
> >
> > (0) http://lkml.org/lkml/2008/5/21/12
>
> What about using major,minor numbers to identify each device and account
> IO statistics? If a device is unplugged we could reset IO statistics
> and/or remove IO limitations for that device from userspace (i.e. by a
> deamon), but pluggin/unplugging the device would not be blocked/affected
> in any case. Or am I oversimplifying the problem?
If a resource we want to control (a block device in this case) is
hot-plugged/unplugged the corresponding cgroup-related structures inside
the kernel need to be allocated/freed dynamically, respectively. The
problem is that this is not always possible. For example, with the
current implementation of cgroups it is not possible to treat each block
device as a different cgroup subsytem/resource controlled, because
subsystems are created at compile time.

> > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
> >
> > The implementation of an I/O scheduling algorithm is to a certain extent
> > influenced by what we are trying to achieve in terms of I/O bandwidth
> > shaping, but, as discussed below, the required accuracy can determine
> > the layer where the I/O controller has to reside. Off the top of my
> > head, there are three basic operations we may want perform:
> > - I/O nice prioritization: ionice-like approach.
> > - Proportional bandwidth scheduling: each process/group of processes
> > has a weight that determines the share of bandwidth they receive.
> > - I/O limiting: set an upper limit to the bandwidth a group of tasks
> > can use.
>
> Use a deadline-based IO scheduling could be an interesting path to be
> explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth
> requirements.
Please note that the only thing we can do is to guarantee minimum
bandwidth requirement when there is contention for an IO resource, which
is precisely what a proportional bandwidth scheduler does. An I missing
something?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

"Naveen Gupta" 08-11-2008 06:18 PM

RFC: I/O bandwidth controller (was Too many I/O controller patches)
 
Hello Fernando


2008/8/7 Fernando Luis Vzquez Cao <fernando@oss.ntt.co.jp>:
> Hi Naveen,
>
> On Wed, 2008-08-06 at 12:37 -0700, Naveen Gupta wrote:
>> > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>> >
>> > The implementation of an I/O scheduling algorithm is to a certain extent
>> > influenced by what we are trying to achieve in terms of I/O bandwidth
>> > shaping, but, as discussed below, the required accuracy can determine
>> > the layer where the I/O controller has to reside. Off the top of my
>> > head, there are three basic operations we may want perform:
>> > - I/O nice prioritization: ionice-like approach.
>> > - Proportional bandwidth scheduling: each process/group of processes
>> > has a weight that determines the share of bandwidth they receive.
>> > - I/O limiting: set an upper limit to the bandwidth a group of tasks
>> > can use.
>>
>> I/O limiting can be a special case of proportional bandwidth
>> scheduling. A process/process group can use use it's share of
>> bandwidth and if there is spare bandwidth it be allowed to use it. And
>> if we want to absolutely restrict it we add another flag which
>> specifies that the specified proportion is exact and has an upper
>> bound.
>>
>> Let's say the ideal b/w for a device is 100MB/s
>>
>> And process 1 is assigned b/w of 20%. When we say that the proportion
>> is strict, the b/w for process 1 will be 20% of the max b/w (which may
>> be less than 100MB/s) subject to a max of 20MB/s.
> I essentially agree with you. The nice thing about proportional
> bandwidth scheduling is that we get bandwidth guarantees when there is
> contention for the block device, but still get the benefits of
> statistical multiplexing in the non-contended case. With strict IO
> limiting we risk underusing the block devices.
>
>> > If we are pursuing a I/O prioritization model la CFQ the temptation is
>> > to implement it at the elevator layer or extend any of the existing I/O
>> > schedulers.
>> >
>> > There have been several proposals that extend either the CFQ scheduler
>> > (see (1), (2) below) or the AS scheduler (see (3) below). The problem
>> > with these controllers is that they are scheduler dependent, which means
>> > that they become unusable when we change the scheduler or when we want
>> > to control stacking devices which define their own make_request_fn
>> > function (md and dm come to mind). It could be argued that the physical
>> > devices controlled by a dm or md driver are likely to be fed by
>> > traditional I/O schedulers such as CFQ, but these I/O schedulers would
>> > be running independently from each other, each one controlling its own
>> > device ignoring the fact that they part of a stacking device. This lack
>> > of information at the elevator layer makes it pretty difficult to obtain
>> > accurate results when using stacking devices. It seems that unless we
>> > can make the elevator layer aware of the topology of stacking devices
>> > (possibly by extending the elevator API?) evelator-based approaches do
>> > not constitute a generic solution. Here onwards, for discussion
>> > purposes, I will refer to this type of I/O bandwidth controllers as
>> > elevator-based I/O controllers.
>>
>> It can be argued that any scheduling decision wrt to i/o belongs to
>> elevators. Till now they have been used to improve performance. But
>> with new requirements to isolate i/o based on process or cgroup, we
>> need to change the elevators.
> I have the impression there is a tendency to conflate two different
> issues when discussing I/O schedulers and resource controllers, so let
> me elaborate on this point.
>
> On the one hand, we have the problem of feeding physical devices with IO
> requests in such a way that we squeeze the maximum performance out of
> them. Of course in some cases we may want to prioritize responsiveness
> over throughput. In either case the kernel has to perform the same basic
> operations: merging and dispatching IO requests. There is no discussion
> this is the elevator's job and the elevator should take into account the
> physical characteristics of the device.
>
> On the other hand, there is the problem of sharing an IO resource, i.e.
> block device, between multiple tasks or groups of tasks. There are many
> ways of sharing an IO resource depending on what we are trying to
> accomplish: proportional bandwidth scheduling, priority-based
> scheduling, etc. But to implement this sharing algorithms the kernel has
> to determine the task whose IO will be submitted. In a sense, we are
> scheduling tasks (and groups of tasks) not IO requests (which has much
> in common with CPU scheduling). Besides, the sharing problem is not
> directly related to the characteristics of the underlying device, which
> means it does not need to be implemented at the elevator layer.

What if we pass the task specific information to the elevator. We do
this for CFQ (where we pass the priority). And if we need any
additional information to be passed we could add that in a similar
manner.

I really liked your initial suggestion where step 1 would be to add
I/O tracking patches. And then use this in CFQ and AS to do resource
sharing. And if we see any shortcoming with this approach. Let's see
what the best place is to solve remaining problems.


>
> Traditional elevators limit themselves to schedule IO requests to disk
> with no regard to where it came from. However, new IO schedulers such as
> CFQ combine this with IO prioritization capabilities. This means that
> the elevator decides the application whose IO will be dispatched next.
> The problem is that at this layer there is not enough information to
> make such decisions in an accurate way, because, as mentioned in the
> RFC, the elevator has not way to know the block IO topology. The
> implication of this is that the elevator does not know the impact a
> particular scheduling decision will make in the IO throughput seen by
> applications, which is what users care about.

Is it possible to send the topology information to the elevators. And
then they can make global as well as local decisions.

>
> For all these reasons, I think the elevator should take care of
> optimizing the last stretch of the IO path (generic block layer -> block
> device) for performance/responsiveness, and leave the job of ensuring
> that each task is guaranteed a fair share of the kernel's IO resources
> to the upper layers (for example a block layer resource controller).
>
> I recognize that in some cases global performance could be improved if
> the block layer had access to information from the elevator, and that is
> why I mentioned in the RFC that in some cases it might make sense to
> combine a block layer resource controller and a elevator layer one (we
> just would need to figure out a way for the to communicate with each
> other and work well in tandem).
>
>> If we add another layer of i/o scheduling (block layer I/O controller)
>> above elevators
>> 1) It builds another layer of i/o scheduling (bandwidth or priority)
> As I mentioned before we are trying to achieve two things: making the
> best use of block devices, and sharing those IO resources between tasks
> or groups of tasks. There are two possible approaches here: implement
> everything in the elevator or move the sharing bits somewhere above the
> elevator layer. In either case we have to carry out the same tasks so
> the impact of delegating part of the work to a new layer should not be
> that big, and, hopefully, will improve maintainability.
>
>> 2) This new layer can have decisions for i/o scheduling which conflict
>> with underlying elevator. e.g. If we decide to do b/w scheduling in
>> this new layer, there is no way a priority based elevator could work
>> underneath it.
> The priority system could be implemented above the elevator layer in the
> block layer resource controller, which means that the elevator would
> only have to worry about scheduling the requests it receives from the
> block layer and dispatching them to disk in the best possible way.
>
> An alternative would be using a block layer resource controller and a
> elavator-based resource controller in tandem.
>
>> If a custom make_request_fn is defined (which means the said device is
>> not using existing elevator),
> Please note that each of the block devices that constitute a stacking
> device could have its own elevator.

Another possible approach, if the top layer cannot pass topology info
to the underling block device elevators. We could use FIFO for the
underlying block devices, effectively disabling them. The Top layer
will make it's scheduling decision in custom __make_request and the
layers below will just forward. And we can easily avoid any conflict.

>
>> it could build it's own scheduling
>> rather than asking kernel to add another layer at the time of i/o
>> submission. Since it has complete control of i/o.
> I think that is something we should avoid. The IO scheduling behavior
> that the user sees should not depend on the topology of the system. We
> certainly do not want to reimplement the same scheduling algorithm for
> every RAID driver. I am of the opinion that whatever IO scheduling
> algorithm we choose should be implemented just once and usable under any
> IO configuration.
>
I agree that we shouldn't be reinventing things for every RAID driver.
We could have a generic algorithm which everyone plugs into. If not
that is not possible, we always have the option to create one in
custom __make_request.

>


-Naveen

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


All times are GMT. The time now is 03:35 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.