FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-24-2012, 06:14 PM
Jim Ramsay
 
Default reworked dm-switch target

On Tue, Aug 21, 2012 at 09:02:35PM -0400, Mikulas Patocka wrote:
> On Tue, 21 Aug 2012, Jim Ramsay wrote:
> > On Mon, Aug 20, 2012 at 03:20:42PM -0400, Mikulas Patocka wrote:
> > > On Fri, 17 Aug 2012, Jim_Ramsay@DELL.com wrote:
> > > > 1) Uploading large page tables
<snip>
> I converted the format to use hexadecimal numbers (they are faster to
> produce and faster to parse) and made an option to omit the page number
> (in this case, the previous page plus one is used) - and it takes 0.05s
> to load a table with one million entries on 2.3GHz Opteron.
>
> The table is loaded with 67 dm message calls, each having 45000 bytes
> (the number 45000 was experimentally found to be near the optimum).
>
> So I don't think there are performance problems with this.
>
> I'll send you the program that updates the table with messages.

Thanks for this change, and for doing that performance test.

We are interested in the relative performance between using the dm
message interface and the netlink interface in uploading a million page
table entries. If this dm message-based format is close enough, it
would certainly be an acceptable replacement of the netlink mechanism.

> > > > Perhaps we can work with you on designing alternate non-netlink mechanism
> > > > to achieve the same goal... A sysfs file per DM device for userland
> > > > processes to do direct I/O with? Base64-encoding larger chunks of the
> > > > binary page tables and passing those values through 'dmsetup message'?
> > >
> > > As I said, you don't have to upload the whole table with one message ...
> > > or if you really need to update the whole table at once, explain why.
> >
> > At the very least, we would need to update the whole page table in the
> > following scenarios:
> >
> > 1) When we first learn the geometry of the volume
> >
> > 2) When the volume layout changes significantly (for example, if it was
> > previously represented by 2 devices and is then later moved onto 3
> > devices, or the underlying LUN is resized)
> >
> > 3) When the protocol used to fetch the data can fetch segments of the
> > page table in a dense binary formate, it is considerably more work
> > for a userland processes to keep its own persistent copy of the
> > page table, compare a new version with the old version, calculate
> > the differences, and send only those differences. It is much
> > simpler to have a binary conduit to upload the entire table at
> > once, provided it does not occur too frequently.
>
> But you don't have to upload the table at once - you can upload the table
> incrementally with several dm messages.

By "all at once" I was talking about the scenarios when you need to push
all 1000000 entries to the kernel driver. The netlink implementation
also sends the data in chunks.

The question as to whether we should do this in one message or multiple
messages (and how many and how large they are) is better answered by
checking relative performance between this dm message code and our
existing netlink code.

> > Furthermore, if a userland process already has an internal binary
> > representation of a page map, what is the value in converting this into
> > a complicated human-readable ascii representation then having the kernel
> > do the opposite de-conversion when it receives the data?
>
> The reason is simplicity - the dm message code is noticeably smaller than
> the netlink code. It is also less bug-prone because no structures are
> allocated or freed there.

I do like the simplicity of the dm message interface, but the cost of
that simplicity seems to be that it just doesn't seem to be well suited
for sending large amounts of packed binary data. It's also great for
crafting test data by hand, but it's more complicated for userland
programs who now need to convert binary data into ascii before sending
it.

I think though that as long as the cost of uploading the whole page
table from start to finish isn't too great, a dm message based mechanism
would be acceptable.

> > > > 2) vmalloc and TLB performance
<snip>
> > The table would also have to be reallocated on LUN resize or if the data
> > is moved to be across a different number of devices (provided the change
> > is such that it causes the number of bits-per-page to be changed), such
> > as if you had a 2-device setup represented by 1-bit-per-page change to a
> > 3-device setup represented by 2-bit-per-page.
> >
> > Granted these are not frequent operations, but we need to continue to
> > properly handle these cases.
> >
> > We also need to keep the multiple device scenario in mind (perhaps 100s of
> > targets in use or being created simultaneously).
>
> For these operations (resizing the device or changing the number of
> underlying devices), you can load a new table, suspend the device and
> resume the device. It will switch to the new table and destroy the old
> one.
>
> You have to reload the table anyway if you change device size, so there is
> no need to include code to change table size in the target driver.

Good points.

> > > > And, on all systems, use of space in the vmalloc() range
> > > > increases pressure on the translation lookaside buffer (TLB), reducing the
> > > > performance of the system."
> > > >
> > > > The page table lookup is in the I/O path, so performance is an important
> > > > consideration. Do you have any performance comparisons between our
> > > > existing 2-level lookup of kmalloc'd memory versus a single vmalloc'd

Besides the performance consideration of uploading a large page table to
the device, the actual I/O performance is another important
consideration we have not yet addressed.

I think a side-by-side comparison of I/O performance would be useful to
see, comparing the vmalloc single table versus the kmalloc 2-step
lookup. We are curious to see if there is any impact to doing lookups
all over a large vmalloc'd area in multiple disks simultaneously.

> > > > Also, in the example above with 1572864 page table entries, assuming 2
> > > > bits per entry requires a table of 384KB. Would this be a problem for the
> > > > vmalloc system, especially on 32-bit systems, if there are multiple
> > > > devices of similarly large size in use at the same time?
> > >
> > > 384KB is not a problem, the whole vmalloc space has 128MB.
> >
> > This means we could allow ~375 similarly-sized devices in the system,
> > assuming no other kernel objects are consuming any vmalloc space. This
> > could be okay, provided our performance considerations are also
> > addressed, but allowing sparse allocation may be a good enough reason
> > to use a 2-level allocation scheme.
> >
> > > > It can also be desirable to allow sparsely-populated page tables, when it
> > > > is known that large chunks are not needed or deemed (by external logic)
> > > > not important enough to consume kernel memory. A 2-level kmalloc'd memory
> > > > scheme can save memory in sparsely-allocated situations.
> >
> > This ability to do sparse allocations may be important depending on what
> > else is going on in the kernel and using vmalloc space.
>
> It may be possible to use radix tree and do sparse allocations, but given
> the current usage (tables with million entries, each entry having a few
> bits), it doesn't seem as a problem now.

I suppose it depends on how general we want this driver to be. The
number of pages could be considerably larger if the underlying volumes
are larger or if the page sizes were considerably smaller. For our
particular use of this device, and a reasonable look into the
not-too-distant future, I believe we would be happy if it works well
with the tens-millions-of-pages scope. We are less concerned about the
case of hundreds-of-millions-of-pages or larger, at least for now.

I'm also not that familiar with what other devices use vmalloc space in
the kernel - With a limited resource like this we must make sure we can
properly contend with other consumers of the space.

So in conclusion, I think we're converging on something that we're all
going to be happy with - We just need to ensure that the performance of
the proposed changes are acceptable compared to our existing code.

To that end I will be devoting some time next week to getting your
driver working with our userland to test page table uploading and actual
IO performance. Would you mind taking some time to give a try to the
latest 'v4' version of this driver, and doing an independent comparison?

v4 driver code is here:
http://www.redhat.com/archives/dm-devel/2012-August/msg00299.html
v4 userland example code is here:
http://www.redhat.com/archives/dm-devel/2012-August/msg00300.html

--
Jim Ramsay

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-30-2012, 12:51 AM
Mikulas Patocka
 
Default reworked dm-switch target

On Fri, 24 Aug 2012, Jim Ramsay wrote:

> I'm also not that familiar with what other devices use vmalloc space in
> the kernel - With a limited resource like this we must make sure we can
> properly contend with other consumers of the space.
>
> So in conclusion, I think we're converging on something that we're all
> going to be happy with - We just need to ensure that the performance of
> the proposed changes are acceptable compared to our existing code.
>
> To that end I will be devoting some time next week to getting your
> driver working with our userland to test page table uploading and actual
> IO performance. Would you mind taking some time to give a try to the
> latest 'v4' version of this driver, and doing an independent comparison?
>
> v4 driver code is here:
> http://www.redhat.com/archives/dm-devel/2012-August/msg00299.html
> v4 userland example code is here:
> http://www.redhat.com/archives/dm-devel/2012-August/msg00300.html
>
> --
> Jim Ramsay

Hi

I tested your and my dm-switch implementation. I created 3 1GB ramdisk
devices, filled them with data (so that alloc-on-demand in the ramdisk
driver doesn't influence the benchmark). Then I placed dm-switch device on
these ramdisks, with page size 3 sectors and page table of repetitive
pattern 012012012...

I ran fio command performing 8 parallel I/O threads (on two quad-core
Opterons), each reading or writing 512-bytes with direct I/O: "fio
--rw=randrw --size=1G --bs=512 --filename=/dev/mapper/switch2 --direct=1
--name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6
--name=job7 --name=job8"

Basically, the worst performance brake is the spinlock in your code,
otherwise there is not much difference. vmalloc doesn't slow it down
(actually it was slightly faster with vmalloc than with kmalloc).

Your original code
15.15s, stdev 0.05

Your code (with spinlock removed):
14.33s, stdev 0.17

Your code (with spinlock and RCU removed):
14.38s, stdev 0.16

My code (with kmalloc):
14.45s, stdev 0.17

My code (with vmalloc):
14.31s, stdev 0.11

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-07-2012, 07:58 PM
Jim Ramsay
 
Default reworked dm-switch target

On Wed, Aug 29, 2012 at 08:51:06PM -0400, Mikulas Patocka wrote:
> Basically, the worst performance brake is the spinlock in your code,
> otherwise there is not much difference. vmalloc doesn't slow it down
> (actually it was slightly faster with vmalloc than with kmalloc).

Thanks very much for your performance testing. I've completed my
testing too, and my results agree with yours. When running against our
actual iSCSI storage solution, the IO performance for this vmalloc-based
code is equivalent to the 2-step kmalloc code.

I also did a side-by-side comparison of uploading 1048576 page table
entries. The netlink code averaged out to 0.0026s, and the DM message
code averaged out to 0.042s. As expected, the message code is slower,
but well within reasonable performance expectations.

I am attaching below an updated version of your 'dm-switch.c' file,
based on your latest post in
http://www.redhat.com/archives/dm-devel/2012-August/msg00224.html that
makes the following changes:

1. Support for FLUSH and DISCARD operations by implementing
target_type.iterate_devices and handling (bio->bi_rw & REQ_FLUSH) in
switch_map. Sends DISCARD to one path, FLUSH to each path.

2. Send IOCTLs to the device who owns sector 0, instead of
pctx->dev_list[0]

3. Copyright notice update in header, plus adding myself to
MODULE_AUTHOR

(Question: Would you prefer these as a patch series against your dm-switch.c
instead?)

------ dm-switch.c ------
/*
* Copyright (c) 2010-2012 by Dell Inc. All rights reserved.
*
* This file is released under the GPL.
*
* Description:
*
* file: dm-switch.c
* authors: Kevin_OKelley@dell.com
* Jim_Ramsay@dell.com
* Narendran_Ganapathy@dell.com
* mpatocka@redhat.com
*
* This file implements a "switch" target which efficiently implements a
* mapping of IOs to underlying block devices in scenarios where there are:
* (1) a large number of address regions
* (2) a fixed size equal across all address regions
* (3) no pattern than allows for a compact description with something like
* the dm-stripe target.
*/

#include <linux/module.h>
#include <linux/init.h>
#include <linux/device-mapper.h>
#include <linux/vmalloc.h>

#define DM_MSG_PREFIX "switch"

/*
* Switch device context block: A new one is created for each dm device.
* Contains an array of devices from which we have taken references.
*/
struct switch_dev {
struct dm_dev *dmdev;
sector_t start;
};

typedef unsigned long pt_entry;

/* Switch context header */
struct switch_ctx {
unsigned dev_count; /* Number of devices */
unsigned page_size; /* Page size in 512B sectors */
unsigned long n_pages; /* Number of pages */
signed char page_size_bits; /* log2 of page_size or -1 */

unsigned char pte_size; /* Page table entry size in bits */
unsigned char pte_fields; /* Number of entries per pt_entry */
signed char pte_fields_bits; /* log2 of pte_fields or -1 */
pt_entry *page_table; /* Page table */

/* Array of dm devices to switch between */
struct switch_dev dev_list[0];
};

static inline void switch_get_position(struct switch_ctx *pctx,
unsigned long page,
unsigned long *index,
unsigned *bit)

{
if (pctx->pte_fields_bits >= 0) {
*index = page >> pctx->pte_fields_bits;
*bit = page & (pctx->pte_fields - 1);
} else {
*index = page / pctx->pte_fields;
*bit = page % pctx->pte_fields;
}
*bit *= pctx->pte_size;

}

static inline unsigned switch_get_deviceidx(struct switch_ctx *pctx,
sector_t sector)
{
unsigned long index;
unsigned bit, idev;
sector_t p;

p = sector;
if (pctx->page_size_bits >= 0)
p >>= pctx->page_size_bits;
else
sector_div(p, pctx->page_size);

switch_get_position(pctx, p, &index, &bit);
idev = (ACCESS_ONCE(pctx->page_table[index]) >> bit) &
((1 << pctx->pte_size) - 1);

/* This can only happen if the processor uses non-atomic stores. */
if (unlikely(idev >= pctx->dev_count))
idev = 0;

return idev;
}

static void switch_page_table_write(struct switch_ctx *pctx, unsigned long page,
unsigned value)
{
unsigned long index;
unsigned bit;
pt_entry pte;

switch_get_position(pctx, page, &index, &bit);

pte = pctx->page_table[index];
pte &= ~((((pt_entry)1 << pctx->pte_size) - 1) << bit);
pte |= (pt_entry)value << bit;
pctx->page_table[index] = pte;
}

/*
* Constructor: Called each time a dmsetup command creates a dm device. The
* target parameter will already have the table, type, begin and len fields
* filled in. Arguments are in pairs: <dev_path> <offset>. Therefore, we get
* multiple constructor calls, but we will need to build a list of switch_ctx
* blocks so that the page table information gets matched to the correct
* device.
*/
static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
unsigned a;
int n;
int r;
unsigned dev_count;
struct switch_ctx *pctx;
sector_t dev_size;
unsigned long e;

if (argc < 4) {
ti->error = "Insufficient arguments";
r = -EINVAL;
goto error;
}
if (kstrtouint(argv[0], 10, &dev_count) ||
!dev_count ||
dev_count > (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_dev)) {
ti->error = "Invalid device count";
r = -EINVAL;
goto error;
}
if (dev_count != (argc - 2) / 2) {
ti->error = "Invalid argument count";
r = -EINVAL;
goto error;
}
pctx = kmalloc(sizeof(struct switch_ctx) + (dev_count * sizeof(struct switch_dev)),
GFP_KERNEL);
if (!pctx) {
ti->error = "Cannot allocate redirect context";
r = -ENOMEM;
goto error;
}
pctx->dev_count = dev_count;
if (kstrtouint(argv[1], 10, &pctx->page_size) ||
!pctx->page_size) {
ti->error = "Invalid page size";
r = -EINVAL;
goto error_kfree;
}

if (!(pctx->page_size & (pctx->page_size - 1)))
pctx->page_size_bits = __ffs(pctx->page_size);
else
pctx->page_size_bits = -1;

pctx->pte_size = 1;
while (pctx->pte_size < sizeof(pt_entry) * 8 &&
(pt_entry)1 << pctx->pte_size < pctx->dev_count)
pctx->pte_size++;

pctx->pte_fields = (sizeof(pt_entry) * 8) / pctx->pte_size;
if (!(pctx->pte_fields & (pctx->pte_fields - 1)))
pctx->pte_fields_bits = __ffs(pctx->pte_fields);
else
pctx->pte_fields_bits = -1;

dev_size = ti->len;
if (sector_div(dev_size, pctx->page_size))
dev_size++;

pctx->n_pages = dev_size;
if (pctx->n_pages != dev_size || pctx->n_pages >= ULONG_MAX) {
ti->error = "Too long page table";
r = -EINVAL;
goto error_kfree;
}

if (sector_div(dev_size, pctx->pte_fields))
dev_size++;

if (dev_size > ULONG_MAX / sizeof(pt_entry)) {
ti->error = "Too long page table";
r = -EINVAL;
goto error_kfree;
}

r = dm_set_target_max_io_len(ti, pctx->page_size);
if (r)
goto error_kfree;

pctx->page_table = vmalloc(dev_size * sizeof(pt_entry));
if (!pctx->page_table) {
ti->error = "Cannot allocate page table";
r = -ENOMEM;
goto error_kfree;
}

a = 0;
for (e = 0; e < pctx->n_pages; e++) {
switch_page_table_write(pctx, e, a);
a++;
if (a >= pctx->dev_count)
a = 0;
}

/*
* Check each device beneath the target to ensure that the limits are
* consistent.
*/
for (n = 0, a = 2; n < pctx->dev_count; n++, a += 2) {
struct dm_dev *dm;
sector_t dev_size;
unsigned long long start;

if (kstrtoull(argv[a + 1], 10, &start) ||
start != (sector_t)start) {
ti->error = "Invalid device starting offset";
r = -EINVAL;
n--;
goto error_release_n;
}
r = dm_get_device
(ti, argv[a], dm_table_get_mode(ti->table), &dm);
if (r) {
ti->error = "Device lookup failed";
n--;
goto error_release_n;
}
pctx->dev_list[n].dmdev = dm;
pctx->dev_list[n].start = start;

dev_size = i_size_read(dm->bdev->bd_inode) >> SECTOR_SHIFT;

if (ti->len > start + dev_size) {
ti->error = "Device is too small";
r = -EINVAL;
goto error_release_n;
}
}

/* For UNMAP, sending the request down any path is sufficient */
ti->num_discard_requests = 1;
/* For FLUSH, we should flush each path */
ti->num_flush_requests = pctx->dev_count;

ti->private = pctx;

return 0;

error_release_n: /* De-reference all devices */
for (; n >= 0; n--)
dm_put_device(ti, pctx->dev_list[n].dmdev);

vfree(pctx->page_table);
error_kfree:
kfree(pctx);

error:
return r;
}

/*
* Destructor: Don't free the dm_target, just the ti->private data (if any).
*/
static void switch_dtr(struct dm_target *ti)
{
int n;
struct switch_ctx *pctx = ti->private;

for (n = 0; n < pctx->dev_count; n++)
dm_put_device(ti, pctx->dev_list[n].dmdev);

vfree(pctx->page_table);
kfree(pctx);
}

static int switch_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context)
{
struct switch_ctx *pctx = ti->private;

sector_t offset = bio->bi_sector - ti->begin;
unsigned idev;

if (bio->bi_rw & REQ_FLUSH) {
int request_nr = map_context->target_request_nr;
BUG_ON(request_nr >= pctx->dev_count);
bio->bi_bdev = pctx->dev_list[request_nr].dmdev->bdev;
return DM_MAPIO_REMAPPED;
}

idev = switch_get_deviceidx(pctx, offset);

bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
bio->bi_sector = pctx->dev_list[idev].start + offset;

return DM_MAPIO_REMAPPED;
}

/*
* We need to parse hex numbers as fast as possible.
* Message is used to load the whole table.
*
* This table-based hex parser improves performance.
* It improves a time to load 1000000 entries compared to the condition-based
* parser.
* table-based parser condition-based parser
* PA-RISC 0.29s 0.31s
* Opteron 0.0495s 0.0498s
*/

static const unsigned char hex_table[256] = {
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
0,1,2,3,4,5,6,7,8,9,255,255,255,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255, 255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255, 255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255
};

static inline void parse_hex(const char *string, sector_t *result, const char **end)
{
unsigned char d;
sector_t r = 0;
#if 1
while ((d = hex_table[(unsigned char)*string]) < 16) {
r = (r << 4) | d;
string++;
}
#else
while (1) {
d = *string;
if (d >= '0' && d <= '9')
d -= '0';
else if (d >= 'A' && d <= 'F')
d -= 'A' - 10;
else if (d >= 'a' && d <= 'f')
d -= 'a' - 10;
else
break;
r = (r << 4) | d;
string++;
}
#endif
*end = string;
*result = r;
}

static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
{
static DEFINE_MUTEX(message_mutex);

struct switch_ctx *pctx = ti->private;
int r;

mutex_lock(&message_mutex);

if (!argc) {
goto invalid_message;
} else if (!strcasecmp(argv[0], "set-table")) {
unsigned i;
sector_t table_index = 0;
for (i = 1; i < argc; i++) {
sector_t device;
const char *string = argv[i];
if (*string == ':')
table_index++;
else {
parse_hex(string, &table_index, &string);
if (unlikely(*string != ':')) {
invalid_table:
DMWARN("invalid set-table argument");
r = -EINVAL;
goto ret;
}
}
string++;
if (unlikely(!*string))
goto invalid_table;
parse_hex(string, &device, &string);
if (unlikely(*string))
goto invalid_table;
if (unlikely(table_index >= pctx->n_pages)) {
DMWARN("invalid set-table page");
r = -EINVAL;
goto ret;
}
if (unlikely(device >= pctx->dev_count)) {
DMWARN("invalid set-table device");
r = -EINVAL;
goto ret;
}
switch_page_table_write(pctx, table_index, device);
}
r = 0;
} else {
invalid_message:
DMWARN("unrecognised message received.");
r = -EINVAL;
}
ret:
mutex_unlock(&message_mutex);
return r;
}

static int switch_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct switch_ctx *pctx = ti->private;
unsigned sz = 0;
int n;

result[0] = '';
switch (type) {
case STATUSTYPE_INFO:
result[0] = 0;
break;

case STATUSTYPE_TABLE:
DMEMIT("%u %u", pctx->dev_count, pctx->page_size);
for (n = 0; n < pctx->dev_count; n++) {
DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
(unsigned long long)pctx->dev_list[n].start);
}
break;

default:
return 0;
}
return 0;
}

/*
* Switch ioctl:
*
* Passthrough all ioctls to the path for sector 0
*/
static int switch_ioctl(struct dm_target *ti, unsigned cmd,
unsigned long arg)
{
struct switch_ctx *pctx = ti->private;
struct block_device *bdev;
fmode_t mode;
unsigned idev;

idev = switch_get_deviceidx(pctx, 0);

bdev = pctx->dev_list[idev].dmdev->bdev;
mode = pctx->dev_list[idev].dmdev->mode;

return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}

static int switch_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
int n, ret = 0;

for (n = 0; n < pctx->dev_count; n++) {
ret = fn(ti, pctx->dev_list[n].dmdev, ti->begin, ti->len, data);
if (ret)
goto out;
}

out:
return ret;
}

static struct target_type switch_target = {
.name = "switch",
.version = {1, 0, 0},
.module = THIS_MODULE,
.ctr = switch_ctr,
.dtr = switch_dtr,
.map = switch_map,
.message = switch_message,
.status = switch_status,
.ioctl = switch_ioctl,
.iterate_devices = switch_iterate_devices,
};

int __init dm_switch_init(void)
{
int r;

r = dm_register_target(&switch_target);
if (r) {
DMERR("dm_register_target() failed %d", r);
return r;
}

return 0;
}

void dm_switch_exit(void)
{
dm_unregister_target(&switch_target);
}

module_init(dm_switch_init);
module_exit(dm_switch_exit);

MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay@dell.com>");
MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
MODULE_LICENSE("GPL");
-------------------------

--
Jim Ramsay

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-08-2012, 04:35 PM
Mikulas Patocka
 
Default reworked dm-switch target

Hi

On Fri, 7 Sep 2012, Jim Ramsay wrote:

> On Wed, Aug 29, 2012 at 08:51:06PM -0400, Mikulas Patocka wrote:
> > Basically, the worst performance brake is the spinlock in your code,
> > otherwise there is not much difference. vmalloc doesn't slow it down
> > (actually it was slightly faster with vmalloc than with kmalloc).
>
> Thanks very much for your performance testing. I've completed my
> testing too, and my results agree with yours. When running against our
> actual iSCSI storage solution, the IO performance for this vmalloc-based
> code is equivalent to the 2-step kmalloc code.
>
> I also did a side-by-side comparison of uploading 1048576 page table
> entries. The netlink code averaged out to 0.0026s, and the DM message
> code averaged out to 0.042s. As expected, the message code is slower,
> but well within reasonable performance expectations.
>
> I am attaching below an updated version of your 'dm-switch.c' file,
> based on your latest post in
> http://www.redhat.com/archives/dm-devel/2012-August/msg00224.html that
> makes the following changes:
>
> 1. Support for FLUSH and DISCARD operations by implementing
> target_type.iterate_devices and handling (bio->bi_rw & REQ_FLUSH) in
> switch_map. Sends DISCARD to one path, FLUSH to each path.
>
> 2. Send IOCTLs to the device who owns sector 0, instead of
> pctx->dev_list[0]
>
> 3. Copyright notice update in header, plus adding myself to
> MODULE_AUTHOR
>
> (Question: Would you prefer these as a patch series against your dm-switch.c
> instead?)

You don't have to send it as a patch, I can trivially create the patch on
my own.

Your changes are ok, it could be included in the kernel.

Regarding flushes ... could you please explain how write-back caching
works in your design?

Is there some write-back cache in the host adapter?

Is there some write-back cache in the individual storage nodes? (i.e. does
the node signal end of I/O before it finishes writing data to the disk)

How does the node deal with write-back cache in the disk? Does the node
turn on write back cache in the disk? Does the node send SYNCHRONIZE CACHE
commands to the disk automatically? Or does it send SYNCHRONIZE CACHE
commands only when flush is received over the network?

I'd like to know these things to better understand the behavior of the
flush request.


Another thing - please resend your code with "Signed-off-by". Read the
meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to
the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to
the code. It is a legal requirement, so that you certify that the code is
under the open source license and that you have the right to distribute
the code.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-11-2012, 01:07 PM
Jim Ramsay
 
Default reworked dm-switch target

On Sat, Sep 08, 2012 at 12:35:22PM -0400, Mikulas Patocka wrote:
> Your changes are ok, it could be included in the kernel.
>
> Regarding flushes ... could you please explain how write-back caching
> works in your design?

We are not actually using write-back caching in our design at all, but
assumed that for a general-purpose DM device like dm-switch, some
underlying storage may rely on REQ_FLUSH to ensure proper data
synchronization, so the switch should do the right thing when required.
The code is modeled on dm-stripe.

> Is there some write-back cache in the host adapter?

No, there is not at this time.

> Is there some write-back cache in the individual storage nodes? (i.e. does
> the node signal end of I/O before it finishes writing data to the disk)

No, there is not at this time. The iSCSI-backed SD devices we are
ultimately using do not advertise any write-back cache ('sdparm
--get=WCE' returns 0).

> Another thing - please resend your code with "Signed-off-by". Read the
> meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to
> the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to
> the code. It is a legal requirement, so that you certify that the code is
> under the open source license and that you have the right to distribute
> the code.

Will do shortly. Thanks!

--
Jim Ramsay

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-11-2012, 01:14 PM
Jim Ramsay
 
Default reworked dm-switch target

On Sat, Sep 08, 2012 at 12:35:22PM -0400, Mikulas Patocka wrote:
> Another thing - please resend your code with "Signed-off-by". Read the
> meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to
> the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to
> the code. It is a legal requirement, so that you certify that the code is
> under the open source license and that you have the right to distribute
> the code.

I am attaching below an updated version of your 'dm-switch.c' file,
based on your latest post in
http://www.redhat.com/archives/dm-devel/2012-August/msg00224.html that
makes the following changes:

1. Support for FLUSH and DISCARD operations by implementing
target_type.iterate_devices and handling (bio->bi_rw & REQ_FLUSH) in
switch_map. Sends DISCARD to one path, FLUSH to each path.

2. Send IOCTLs to the device who owns sector 0, instead of
pctx->dev_list[0]

3. Copyright notice update in header, plus adding myself to
MODULE_AUTHOR

Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>

------ dm-switch.c ------
/*
* Copyright (c) 2010-2012 by Dell Inc. All rights reserved.
*
* This file is released under the GPL.
*
* Description:
*
* file: dm-switch.c
* authors: Kevin_OKelley@dell.com
* Jim_Ramsay@dell.com
* Narendran_Ganapathy@dell.com
* mpatocka@redhat.com
*
* This file implements a "switch" target which efficiently implements a
* mapping of IOs to underlying block devices in scenarios where there are:
* (1) a large number of address regions
* (2) a fixed size equal across all address regions
* (3) no pattern than allows for a compact description with something like
* the dm-stripe target.
*/

#include <linux/module.h>
#include <linux/init.h>
#include <linux/device-mapper.h>
#include <linux/vmalloc.h>

#define DM_MSG_PREFIX "switch"

/*
* Switch device context block: A new one is created for each dm device.
* Contains an array of devices from which we have taken references.
*/
struct switch_dev {
struct dm_dev *dmdev;
sector_t start;
};

typedef unsigned long pt_entry;

/* Switch context header */
struct switch_ctx {
unsigned dev_count; /* Number of devices */
unsigned page_size; /* Page size in 512B sectors */
unsigned long n_pages; /* Number of pages */
signed char page_size_bits; /* log2 of page_size or -1 */

unsigned char pte_size; /* Page table entry size in bits */
unsigned char pte_fields; /* Number of entries per pt_entry */
signed char pte_fields_bits; /* log2 of pte_fields or -1 */
pt_entry *page_table; /* Page table */

/* Array of dm devices to switch between */
struct switch_dev dev_list[0];
};

static inline void switch_get_position(struct switch_ctx *pctx,
unsigned long page,
unsigned long *index,
unsigned *bit)

{
if (pctx->pte_fields_bits >= 0) {
*index = page >> pctx->pte_fields_bits;
*bit = page & (pctx->pte_fields - 1);
} else {
*index = page / pctx->pte_fields;
*bit = page % pctx->pte_fields;
}
*bit *= pctx->pte_size;

}

static inline unsigned switch_get_deviceidx(struct switch_ctx *pctx,
sector_t sector)
{
unsigned long index;
unsigned bit, idev;
sector_t p;

p = sector;
if (pctx->page_size_bits >= 0)
p >>= pctx->page_size_bits;
else
sector_div(p, pctx->page_size);

switch_get_position(pctx, p, &index, &bit);
idev = (ACCESS_ONCE(pctx->page_table[index]) >> bit) &
((1 << pctx->pte_size) - 1);

/* This can only happen if the processor uses non-atomic stores. */
if (unlikely(idev >= pctx->dev_count))
idev = 0;

return idev;
}

static void switch_page_table_write(struct switch_ctx *pctx, unsigned long page,
unsigned value)
{
unsigned long index;
unsigned bit;
pt_entry pte;

switch_get_position(pctx, page, &index, &bit);

pte = pctx->page_table[index];
pte &= ~((((pt_entry)1 << pctx->pte_size) - 1) << bit);
pte |= (pt_entry)value << bit;
pctx->page_table[index] = pte;
}

/*
* Constructor: Called each time a dmsetup command creates a dm device. The
* target parameter will already have the table, type, begin and len fields
* filled in. Arguments are in pairs: <dev_path> <offset>. Therefore, we get
* multiple constructor calls, but we will need to build a list of switch_ctx
* blocks so that the page table information gets matched to the correct
* device.
*/
static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
unsigned a;
int n;
int r;
unsigned dev_count;
struct switch_ctx *pctx;
sector_t dev_size;
unsigned long e;

if (argc < 4) {
ti->error = "Insufficient arguments";
r = -EINVAL;
goto error;
}
if (kstrtouint(argv[0], 10, &dev_count) ||
!dev_count ||
dev_count > (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_dev)) {
ti->error = "Invalid device count";
r = -EINVAL;
goto error;
}
if (dev_count != (argc - 2) / 2) {
ti->error = "Invalid argument count";
r = -EINVAL;
goto error;
}
pctx = kmalloc(sizeof(struct switch_ctx) + (dev_count * sizeof(struct switch_dev)),
GFP_KERNEL);
if (!pctx) {
ti->error = "Cannot allocate redirect context";
r = -ENOMEM;
goto error;
}
pctx->dev_count = dev_count;
if (kstrtouint(argv[1], 10, &pctx->page_size) ||
!pctx->page_size) {
ti->error = "Invalid page size";
r = -EINVAL;
goto error_kfree;
}

if (!(pctx->page_size & (pctx->page_size - 1)))
pctx->page_size_bits = __ffs(pctx->page_size);
else
pctx->page_size_bits = -1;

pctx->pte_size = 1;
while (pctx->pte_size < sizeof(pt_entry) * 8 &&
(pt_entry)1 << pctx->pte_size < pctx->dev_count)
pctx->pte_size++;

pctx->pte_fields = (sizeof(pt_entry) * 8) / pctx->pte_size;
if (!(pctx->pte_fields & (pctx->pte_fields - 1)))
pctx->pte_fields_bits = __ffs(pctx->pte_fields);
else
pctx->pte_fields_bits = -1;

dev_size = ti->len;
if (sector_div(dev_size, pctx->page_size))
dev_size++;

pctx->n_pages = dev_size;
if (pctx->n_pages != dev_size || pctx->n_pages >= ULONG_MAX) {
ti->error = "Too long page table";
r = -EINVAL;
goto error_kfree;
}

if (sector_div(dev_size, pctx->pte_fields))
dev_size++;

if (dev_size > ULONG_MAX / sizeof(pt_entry)) {
ti->error = "Too long page table";
r = -EINVAL;
goto error_kfree;
}

r = dm_set_target_max_io_len(ti, pctx->page_size);
if (r)
goto error_kfree;

pctx->page_table = vmalloc(dev_size * sizeof(pt_entry));
if (!pctx->page_table) {
ti->error = "Cannot allocate page table";
r = -ENOMEM;
goto error_kfree;
}

a = 0;
for (e = 0; e < pctx->n_pages; e++) {
switch_page_table_write(pctx, e, a);
a++;
if (a >= pctx->dev_count)
a = 0;
}

/*
* Check each device beneath the target to ensure that the limits are
* consistent.
*/
for (n = 0, a = 2; n < pctx->dev_count; n++, a += 2) {
struct dm_dev *dm;
sector_t dev_size;
unsigned long long start;

if (kstrtoull(argv[a + 1], 10, &start) ||
start != (sector_t)start) {
ti->error = "Invalid device starting offset";
r = -EINVAL;
n--;
goto error_release_n;
}
r = dm_get_device
(ti, argv[a], dm_table_get_mode(ti->table), &dm);
if (r) {
ti->error = "Device lookup failed";
n--;
goto error_release_n;
}
pctx->dev_list[n].dmdev = dm;
pctx->dev_list[n].start = start;

dev_size = i_size_read(dm->bdev->bd_inode) >> SECTOR_SHIFT;

if (ti->len > start + dev_size) {
ti->error = "Device is too small";
r = -EINVAL;
goto error_release_n;
}
}

/* For UNMAP, sending the request down any path is sufficient */
ti->num_discard_requests = 1;
/* For FLUSH, we should flush each path */
ti->num_flush_requests = pctx->dev_count;

ti->private = pctx;

return 0;

error_release_n: /* De-reference all devices */
for (; n >= 0; n--)
dm_put_device(ti, pctx->dev_list[n].dmdev);

vfree(pctx->page_table);
error_kfree:
kfree(pctx);

error:
return r;
}

/*
* Destructor: Don't free the dm_target, just the ti->private data (if any).
*/
static void switch_dtr(struct dm_target *ti)
{
int n;
struct switch_ctx *pctx = ti->private;

for (n = 0; n < pctx->dev_count; n++)
dm_put_device(ti, pctx->dev_list[n].dmdev);

vfree(pctx->page_table);
kfree(pctx);
}

static int switch_map(struct dm_target *ti, struct bio *bio,
union map_info *map_context)
{
struct switch_ctx *pctx = ti->private;

sector_t offset = bio->bi_sector - ti->begin;
unsigned idev;

if (bio->bi_rw & REQ_FLUSH) {
int request_nr = map_context->target_request_nr;
BUG_ON(request_nr >= pctx->dev_count);
bio->bi_bdev = pctx->dev_list[request_nr].dmdev->bdev;
return DM_MAPIO_REMAPPED;
}

idev = switch_get_deviceidx(pctx, offset);

bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
bio->bi_sector = pctx->dev_list[idev].start + offset;

return DM_MAPIO_REMAPPED;
}

/*
* We need to parse hex numbers as fast as possible.
* Message is used to load the whole table.
*
* This table-based hex parser improves performance.
* It improves a time to load 1000000 entries compared to the condition-based
* parser.
* table-based parser condition-based parser
* PA-RISC 0.29s 0.31s
* Opteron 0.0495s 0.0498s
*/

static const unsigned char hex_table[256] = {
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
0,1,2,3,4,5,6,7,8,9,255,255,255,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255, 255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,10,11,12,13,14,15,255,255,255,255,255,255,255, 255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255,
255,255,255,255,255,255,255,255,255,255,255,255,25 5,255,255,255
};

static inline void parse_hex(const char *string, sector_t *result, const char **end)
{
unsigned char d;
sector_t r = 0;
#if 1
while ((d = hex_table[(unsigned char)*string]) < 16) {
r = (r << 4) | d;
string++;
}
#else
while (1) {
d = *string;
if (d >= '0' && d <= '9')
d -= '0';
else if (d >= 'A' && d <= 'F')
d -= 'A' - 10;
else if (d >= 'a' && d <= 'f')
d -= 'a' - 10;
else
break;
r = (r << 4) | d;
string++;
}
#endif
*end = string;
*result = r;
}

static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
{
static DEFINE_MUTEX(message_mutex);

struct switch_ctx *pctx = ti->private;
int r;

mutex_lock(&message_mutex);

if (!argc) {
goto invalid_message;
} else if (!strcasecmp(argv[0], "set-table")) {
unsigned i;
sector_t table_index = 0;
for (i = 1; i < argc; i++) {
sector_t device;
const char *string = argv[i];
if (*string == ':')
table_index++;
else {
parse_hex(string, &table_index, &string);
if (unlikely(*string != ':')) {
invalid_table:
DMWARN("invalid set-table argument");
r = -EINVAL;
goto ret;
}
}
string++;
if (unlikely(!*string))
goto invalid_table;
parse_hex(string, &device, &string);
if (unlikely(*string))
goto invalid_table;
if (unlikely(table_index >= pctx->n_pages)) {
DMWARN("invalid set-table page");
r = -EINVAL;
goto ret;
}
if (unlikely(device >= pctx->dev_count)) {
DMWARN("invalid set-table device");
r = -EINVAL;
goto ret;
}
switch_page_table_write(pctx, table_index, device);
}
r = 0;
} else {
invalid_message:
DMWARN("unrecognised message received.");
r = -EINVAL;
}
ret:
mutex_unlock(&message_mutex);
return r;
}

static int switch_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
struct switch_ctx *pctx = ti->private;
unsigned sz = 0;
int n;

result[0] = '';
switch (type) {
case STATUSTYPE_INFO:
result[0] = 0;
break;

case STATUSTYPE_TABLE:
DMEMIT("%u %u", pctx->dev_count, pctx->page_size);
for (n = 0; n < pctx->dev_count; n++) {
DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
(unsigned long long)pctx->dev_list[n].start);
}
break;

default:
return 0;
}
return 0;
}

/*
* Switch ioctl:
*
* Passthrough all ioctls to the path for sector 0
*/
static int switch_ioctl(struct dm_target *ti, unsigned cmd,
unsigned long arg)
{
struct switch_ctx *pctx = ti->private;
struct block_device *bdev;
fmode_t mode;
unsigned idev;

idev = switch_get_deviceidx(pctx, 0);

bdev = pctx->dev_list[idev].dmdev->bdev;
mode = pctx->dev_list[idev].dmdev->mode;

return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}

static int switch_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
int n, ret = 0;

for (n = 0; n < pctx->dev_count; n++) {
ret = fn(ti, pctx->dev_list[n].dmdev, ti->begin, ti->len, data);
if (ret)
goto out;
}

out:
return ret;
}

static struct target_type switch_target = {
.name = "switch",
.version = {1, 0, 0},
.module = THIS_MODULE,
.ctr = switch_ctr,
.dtr = switch_dtr,
.map = switch_map,
.message = switch_message,
.status = switch_status,
.ioctl = switch_ioctl,
.iterate_devices = switch_iterate_devices,
};

int __init dm_switch_init(void)
{
int r;

r = dm_register_target(&switch_target);
if (r) {
DMERR("dm_register_target() failed %d", r);
return r;
}

return 0;
}

void dm_switch_exit(void)
{
dm_unregister_target(&switch_target);
}

module_init(dm_switch_init);
module_exit(dm_switch_exit);

MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");
MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay@dell.com>");
MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
MODULE_LICENSE("GPL");
-------------------------

--
Jim Ramsay

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
--
Jim Ramsay

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-11-2012, 05:13 PM
Mikulas Patocka
 
Default reworked dm-switch target

On Tue, 11 Sep 2012, Jim Ramsay wrote:

>
> On Sat, Sep 08, 2012 at 12:35:22PM -0400, Mikulas Patocka wrote:
> > Your changes are ok, it could be included in the kernel.
> >
> > Regarding flushes ... could you please explain how write-back caching
> > works in your design?
>
> We are not actually using write-back caching in our design at all, but
> assumed that for a general-purpose DM device like dm-switch, some
> underlying storage may rely on REQ_FLUSH to ensure proper data
> synchronization, so the switch should do the right thing when required.
> The code is modeled on dm-stripe.

So you don't need to process REQ_FLUSH at all. REQ_FLUSH doesn't impose
any request synchronization, it only flushes the hardware cache - and
there is no cache. I would remove REQ_FLUSH support if it has no use.

dm-switch is used directly on host adapters (or maybe dm-mpath that is
connected on host adapters), there is no intermediate layer that would
need REQ_FLUSH.

Mikulas

> > Is there some write-back cache in the host adapter?
>
> No, there is not at this time.
>
> > Is there some write-back cache in the individual storage nodes? (i.e. does
> > the node signal end of I/O before it finishes writing data to the disk)
>
> No, there is not at this time. The iSCSI-backed SD devices we are
> ultimately using do not advertise any write-back cache ('sdparm
> --get=WCE' returns 0).
>
> > Another thing - please resend your code with "Signed-off-by". Read the
> > meaning of "Signed-off-by" in Documentation/SubmittingPatches, agree to
> > the terms and append "Signed-off-by: Jim Ramsay <jim_ramsay@dell.com>" to
> > the code. It is a legal requirement, so that you certify that the code is
> > under the open source license and that you have the right to distribute
> > the code.
>
> Will do shortly. Thanks!
>
> --
> Jim Ramsay
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 07:11 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org