FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 03-25-2011, 07:25 PM
"Jason Shamberger"
 
Default RFC: dm-switch target

Title: RFC: dm-switch target







We propose a new DM target, dm-switch, which can be used to efficiently

implement a mapping of IOs to underlying block devices in scenarios where there

are: (1) a large number of address regions, (2) a fixed size of these address

regions, (3) no pattern than allows for a compact description with something

like the dm-stripe target.



Motivation:



Dell EqualLogic and some other iSCSI storage arrays use a distributed frameless

architecture.* In this architecture, the storage group consists of a number of

distinct storage arrays ("members"), each having independent controllers, disk

storage and network adapters.* When a LUN is created it is spread across

multiple members.* The details of the spreading are hidden from initiators

connected to this storage system.* The storage group exposes a single target

discovery portal, no matter how many members are being used.* When iSCSI

sessions are created, each session is connected to an eth port on a single

member.* Data to a LUN can be sent on any iSCSI session, and if the blocks being

accessed are stored on another member the IO will be forwarded as required.

This forwarding is invisible to the initiator.* The storage layout is also

dynamic, and the blocks stored on disk may be moved from member to member as

needed to balance the load.



This architecture simplifies the management and configuration of both the

storage group and initiators.* In a multipathing configuration, it is possible

to set up multiple iSCSI sessions to use multiple network interfaces on both the

host and target to take advantage of the increased network bandwidth.* An

initiator can use a simple round robin algorithm to send IO on all paths and let

the storage array members forward it as necessary.* However, there is a

performance advantage to sending data directly to the correct member.* The

Device Mapper table architecture supports designating different address regions

with different targets.* However, in our architecture the LUN is spread with a

chunk size on the order of 10s of MBs, which means the resulting DM table could

have more than a million entries, which consumes too much memory.



Solution:



Based on earlier discussion with the dm-devel contributors, we have solved this

problem by using Device Mapper to build a two-layer device hierarchy:



*** Upper Tier Determine which array member the IO should be sent to.

*** Lower Tier Load balance amongst paths to a particular member.



The lower tier consists of a single multipath device for each member.* Each of

these multipath devices contains the set of paths directly to the array member

in one priority group, and leverages existing path selectors to load balance

amongst these paths.* We also build a non-preferred priority group containing

paths to other array members for failover reasons.



The upper tier consists of a single switch device, using the new DM target

module proposed here.* This device uses a bitmap to look up the location of the

IO and choose the appropriate lower tier device to route the IO.* By using a

bitmap we are able to use 4 bits for each address range in a 16 member group

(which is very large for us).* This is a much denser representation than the DM

table B-tree can achieve.



Though we have developed this target for a specific storage device, we have made

an effort to keep it a general purpose as possible in hopes that others may

benefit.* We welcome any feedback on the design or implementation.



/*

************************************************** *******************************

**

** Copyright (c) 2010 by Dell, Inc.

**

** All rights reserved.* This software may not be copied, disclosed,

** transferred, or used except in accordance with a license granted

** by Dell, Inc.* This software embodies proprietary information

** and trade secrets of Dell, Inc.

**

** Description:

**

****** file:*** dm-switch.h

****** authors: Kevin_OKelley@dell.com and Narendran_Ganapathy@dell.com

**

** This file contains the definitions for the "switch" target - particularly

** the netlink messages.

**

************************************************** *******************************

**/

/*

** Copyright (C) 2001-2003 Sistina Software (UK) Limited.

** Copyright (C) 2004-2008 Red Hat Inc.* All rights reserved.

**

** This file is released under the GPL.

**/



#ifndef __DM_SWITCH_H

#define __DM_SWITCH_H



#define MAX_IPC_MSG_LEN ******* ******* 65480** ******* ******* // dictated by netlink socket

#define MAX_ERR_STR_LEN ******* ******* 255**** ******* ******* ******* // maximum length of the error string



typedef enum Opcode_Enum

{

*** OPCODE_PAGE_TABLE_UPLOAD = 1,

}

Opcode_t;



/*

** IPC Page Table message

**/

typedef struct IpcPgTable_Struct

{

******* uint32_t******* total_len; **** ******* ******* ******* ******* ******* // Total length of this IPC message

******* Opcode_t******* opcode;

******* uint32_t******* userland[2];*** ******* ******* ******* ******* // Userland optional data (dmsetup status)

******* uint32_t*** dev_major;* ******* ******* ******* ******* ******* // DM device major

******* uint32_t*** dev_minor;* ******* ******* ******* ******* ******* // DM device minor

******* uint32_t******* page_total;**** ******* ******* ******* ******* ******* // Total pages in the volume

******* uint32_t******* page_offset;*** ******* ******* ******* ******* // Starting page offset for this IPC

******* uint32_t******* page_count;**** ******* ******* ******* ******* ******* // Number of page table entries in this IPC

******* uint32_t******* page_size;***** ******* ******* ******* ******* ******* // Page size in 512B sectors

******* uint16_t******* dev_count;***** ******* ******* ******* ******* ******* // Number of devices

******* uint8_t ******* pte_bits;****** ******* ******* ******* ******* ******* // Page Table Entry field size in bits

******* uint8_t ******* reserved;****** ******* ******* ******* ******* ******* // Integer alignment

******* uint8_t ******* ptbl_buff[1]; * ******* ******* ******* ******* // Page table entries (variable length)

}

IpcPgTable_t;



/*

** IPC Response message

**/

typedef struct IpcResponse_Struct

{

******* uint32_t******* total_len;***** ******* ******* ******** ****** ******* // total length of the IPC

******* Opcode_t******* opcode;

******* uint32_t******* userland[2];*** ******* ******* ******* ******* // Userland optional data

******* uint32_t*** dev_major;* ******* ******* ******* ******* ******* // DM device major

******* uint32_t*** dev_minor;* ******* ******* ******* ******* ******* // DM device minor

******* uint32_t******* status;

******* char*** ******* err_str[MAX_ERR_STR_LEN+1];

}

IpcResponse_t;



// Generic Netlink family attributes: used to define the family

enum

{

*** NETLINK_ATTR_UNSPEC,

*** NETLINK_ATTR_MSG,

*** NETLINK_ATTR__MAX,

};

#define NETLINK_ATTR_MAX (NETLINK_ATTR__MAX - 1)



// Netlink commands (operations)

enum

{

*** NETLINK_CMD_UNSPEC,

*** NETLINK_CMD_GET_PAGE_TBL,

*** NETLINK_CMD__MAX,

};

#define NETLINK_CMD_MAX (NETLINK_CMD__MAX - 1)



#endif* /* __DM_SWITCH_H */



/*

************************************************** *******************************

**

** Copyright (c) 2010-2011 by Dell, Inc.

**

** All rights reserved.* This software may not be copied, disclosed,

** transferred, or used except in accordance with a license granted

** by Dell, Inc.* This software embodies proprietary information

** and trade secrets of Dell, Inc.

**

** Description:

**

****** file:*** dm-switch.c

****** authors: Kevin_OKelley@dell.com and Narendran_Ganapathy@dell.com

**

** This file contains all the functions to create a "switch" target to

** separate the MPIO to the preferred block mode devices.

**

************************************************** *******************************

**/

/*

** Copyright (C) 2001-2003 Sistina Software (UK) Limited.

** Copyright (C) 2004-2008 Red Hat Inc.* All rights reserved.

**

** This file is released under the GPL.

**/



#include <linux/module.h>

#include <linux/init.h>

#include <linux/blkdev.h>

#include <linux/bio.h>

#include <linux/slab.h>

#include <linux/device.h>

#include <linux/version.h>

#include <linux/dm-ioctl.h>

#include <linux/device-mapper.h>

#include <net/genetlink.h>

#include <asm/div64.h>



#include "dm-switch.h"



#define DM_MSG_PREFIX "switch"

MODULE_DESCRIPTION(DM_NAME " throughput-oriented path selector");

MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");

MODULE_LICENSE("GPL");



/*

** Switch context block: A new one is created for each dm device.* Contains an array of devices

** that we have taken references.

**/

struct switch_dev {

******* struct dm_dev *dmdev;

******* sector_t******* start;

******* atomic_t******* error_count;

};



struct switch_ptbl {

******* uint32_t******* pte_bits;****** ******* ******* ******* ******* ******* // Page Table Entry field size in bits

******* uint32_t******* pte_mask;****** ******* ******* ******* ******* ******* // Page Table Entry field mask

******* uint32_t******* pte_fields;**** ******* ******* ******* ******* ******* // Number of Page Table Entries per uint32_t

******* uint32_t******* ptbl_bytes;**** ******* ******* ******* ******* ******* // Page table size in bytes

******* uint32_t******* ptbl_num;****** ******* ******* ******* ******* ******* // Page table size in entries

******* uint32_t******* ptbl_max;****** ******* ******* ******* ******* ******* // Page table maximum size in entries;

******* uint32_t******* ptbl_buff[0];** ******* ******* ******* ******* // Address of page table

};



struct switch_ctx {

******* struct list_head list;

******* dev_t** ******* dev_this;****** ******* ******* ******* ******* ******* // Device serviced by this target

******* uint32_t******* dev_count;***** ******* ******* ******* ******* ******* // Number of devices

******* uint32_t******* page_size;***** ******* ******* ******* ******* ******* // Page size in 512B sectors

******* uint32_t******* userland[2];*** ******* ******* ******* ******* // Userland optional data (dmsetup status)

******* uint64_t******* ios_remapped, ios_unmapped;**** ******* // I/Os remapped, I/Os not remapped

******* spinlock_t***** spinlock;****** ******* ******* ******* ******* ******* // Control access to counters



******* struct switch_ptbl *ptbl;****** ******* ******* ******* ******* // Page table (if loaded)

******* struct switch_dev dev_list[0];* ******* ******* ******* // Array of dm devices to switch between

};



/*

** Global variables

**/

LIST_HEAD(__g_context_list);*** ******* ******* ******* ******* // Linked list of context blocks

static spinlock_t __g_spinlock;**************** // Control access to list of context blocks



static int switch_ctr_limits(struct dm_target *ti, struct dm_dev *dm)

{

******* struct block_device *sd = dm->bdev;

******* struct hd_struct *hd = sd->bd_part;



******* if (hd != NULL) {

******* ******* if (ti->len <= hd->nr_sects)

******* ******* ******* return true;

******* ******* ti->error = "Device too small for target";

******* ******* return false;

******* }



******* ti->error = "Missing device limits";

******* printk("%s %s
", __FUNCTION__, ti->error);

******* return true;

}



/*

** Constructor: Called each time a dmsetup command creates a dm device.* The target parameter will already

** have the table, type, begin and len fields filled in.* Arguments are in pairs: <dev_path> <offset>.

** Therefore, we get multiple constructor calls, but we will need to build a list of switch_ctx blocks so

** that the page table information gets matched to the correct device.

**/

static int switch_ctr(struct dm_target *ti, unsigned int argc, char **argv)

{

******* int n;

******* unsigned int dev_count;

******* unsigned long flags, major, minor;

******* unsigned long long start;

******* struct switch_ctx *pctx;

******* struct mapped_device *md = NULL;

******* struct dm_dev *dm;

******* char *chp;



******* if (argc < 4) {

******* ******* ti->error = "Insufficient arguments";

******* ******* return -EINVAL;

******* }

******* dev_count = simple_strtoul(argv[0], &chp, 10);

******* if (*chp) {

******* ******* ti->error = "Invalid device count";

******* ******* return -EINVAL;

******* }

******* if (dev_count != (argc - 2) / 2) {

******* ******* ti->error = "Invalid argument count";

******* ******* return -EINVAL;

******* }

******* pctx = kmalloc(sizeof(*pctx) + (dev_count * sizeof(struct switch_dev)), GFP_KERNEL);

******* if (pctx == NULL) {

******* ******* ti->error = "Cannot allocate redirect context";

******* ******* return -ENOMEM;

******* }

******* pctx->dev_count = dev_count;

******* pctx->page_size = simple_strtoul(argv[1], &chp, 10);

******* if ((*chp) || (pctx->page_size == 0)) {

******* ******* ti->error = "Invalid page size";

******* ******* goto failed_kfree;

******* }

******* pctx->ptbl = NULL;

******* pctx->userland[0] = pctx->userland[1] = 0;

******* pctx->ios_remapped = pctx->ios_unmapped =0;

******* spin_lock_init(&pctx->spinlock);



******* /*

******** * Find the device major and minor for the device that is being served by this target.

******** */

******* md = dm_table_get_md(ti->table);

******* if (md == NULL) {

******* ******* ti->error = "Cannot locate dm device";

******* ******* goto failed_kfree;

******* }

******* chp = (char *) dm_device_name(md);

******* if (chp == NULL) {

******* ******* ti->error = "Cannot acquire dm device name";

******* ******* goto failed_kfree;

******* }

******* major = simple_strtoul(chp, &chp, 10);

******* if (*chp++ != ':') {

******* ******* ti->error = "Invalid dm device name (major)";

******* ******* goto failed_kfree;

******* }

******* minor = simple_strtoul(chp, &chp, 10);

******* if (*chp) {

******* ******* ti->error = "Invalid dm device name (minor)";

******* ******* goto failed_kfree;

******* }

******* pctx->dev_this = MKDEV(major, minor);



******* /*

******** * Check each device beneath the target to ensure that the limits are consistent.

******** */

******* for (n = 0, argc = 2; n < pctx->dev_count; n++, argc += 2) {

******* ******* if (sscanf(argv[argc + 1], "%llu", &start) != 1) {

******* ******* ******* ti->error = "Invalid device starting offset";

******* ******* ******* goto failed_dev_list_prev;

******* ******* }

******* ******* if (dm_get_device(ti, argv[argc], dm_table_get_mode(ti->table), &dm)) {

******* ******* ******* ti->error = "Device lookup failed";

******* ******* ******* goto failed_dev_list_prev;

******* ******* }

******* ******* pctx->dev_list[n].dmdev = dm;

******* ******* pctx->dev_list[n].start = start;

******* ******* atomic_set(&(pctx->dev_list[n].error_count), 0);

******* ******* if (!switch_ctr_limits(ti, dm))

******* ******* ******* goto failed_dev_list_all;

******* }



******* spin_lock_irqsave(&__g_spinlock, flags);

******* list_add_tail(&pctx->list, &__g_context_list);

******* spin_unlock_irqrestore(&__g_spinlock, flags);

******* ti->private = pctx;

******* return 0;



failed_dev_list_prev:** ******* ******* ******* ******* ******* ******* // De-reference previous devices

******* n--;*** ******* ******* ******* ******* ******* ******* ******* ******* ******* //** (i.e. don't include this one)

failed_dev_list_all:*** ******* ******* ******* ******* ******* ******* // De-reference all devices

******* printk("%s device=%s, start=%s
", __FUNCTION__, argv[argc], argv[argc + 1]);

******* for (; n >= 0; n--) {

******* ******* dm_put_device(ti, pctx->dev_list[n].dmdev);

******* }



failed_kfree:

******* printk(KERN_WARNING "%s %s
", __FUNCTION__, ti->error);

******* kfree(pctx);

******* return -EINVAL;

}



/*

** Destructor: Don't free the dm_target, just the ti->private data (if any).

**/

static void switch_dtr(struct dm_target *ti)

{

******* int n;

******* unsigned long flags;

******* struct switch_ctx *pctx = (struct switch_ctx *) ti->private;

******* void *ptbl;



******* spin_lock_irqsave(&__g_spinlock, flags);

******* ptbl = pctx->ptbl;

******* rcu_assign_pointer(pctx->ptbl, NULL);

******* list_del(&pctx->list);

******* spin_unlock_irqrestore(&__g_spinlock, flags);

******* for (n = 0; n < pctx->dev_count; n++) {

******* ******* dm_put_device(ti, pctx->dev_list[n].dmdev);

******* }

******* synchronize_rcu();

******* if (ptbl)

******* ******* kfree(ptbl);

******* kfree(pctx);

}



/*

** NOTE: If CONFIG_LBD is disabled, sector_t types are uint32_t.* Therefore, in this routine, we

** convert the offset into a uint64_t instead of a sector_t so that all of the remaining arithmatic

** is correct, including the do_div() calls.

**/

static int switch_map(struct dm_target *ti, struct bio *bio,

******* ******* union map_info *map_context)

{

******* struct switch_ctx *pctx = (struct switch_ctx *) ti->private;

******* struct switch_ptbl *ptbl;

******* unsigned long flags;

******* uint64_t itbl, offset = bio->bi_sector - ti->begin;

******* uint32_t idev = 0, irem;

******* uint64_t *pinc = &pctx->ios_unmapped;



******* rcu_read_lock();

******* ptbl = rcu_dereference(pctx->ptbl);

******* if (ptbl != NULL)

******* {

******* ******* itbl = offset;

******* ******* do_div(itbl, pctx->page_size);

******* ******* if (itbl < ptbl->ptbl_num) {

******* ******* ******* irem = do_div(itbl, ptbl->pte_fields);

******* ******* ******* idev = (ptbl->ptbl_buff[itbl] >> (irem * ptbl->pte_bits))

******* ******* ******* ******* & ptbl->pte_mask;

******* ******* ******* if (idev <= pctx->dev_count) {

******* ******* ******* ******* pinc = &pctx->ios_remapped;

******* ******* ******* }

******* ******* ******* else {

******* ******* ******* ******* printk(KERN_WARNING "%s dev=%d, offset=%lld
", __FUNCTION__, idev, offset);

******* ******* ******* ******* idev = 0;

******* ******* ******* }

******* ******* }

******* ******* else {

******* ******* ******* printk(KERN_WARNING "%s Page Table Entry %lld >= %d
", __FUNCTION__,

******* ******* ******* ******* ******* itbl, ptbl->ptbl_num);

******* ******* }

******* }

******* rcu_read_unlock();

******* spin_lock_irqsave(&pctx->spinlock, flags);

******* (*pinc)++;

******* spin_unlock_irqrestore(&pctx->spinlock, flags);

******* bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;

******* bio->bi_sector = pctx->dev_list[idev].start + offset;

******* return DM_MAPIO_REMAPPED;

}



/*

** Switch status:

**

** INFO: #dev_count device [device] 5 'A'['A' ...] userland[0] userland[1] #remapped #unmapped

** where:

**** "'A'['A']" is a single word with an 'A' (active) or 'D' for each device

**** The userland values are set by the last userland message to load the page table

**** "#remapped" is the number of remapped I/Os

**** "#unmapped" is the number of I/Os that could not be remapped

**

** TABLE: #page_size #dev_count device start [device start ...]

**/

static int switch_status(struct dm_target *ti, status_type_t type, char *result,

******* ******* unsigned int maxlen)

{

******* struct switch_ctx *pctx = (struct switch_ctx *) ti->private;

******* char buffer[pctx->dev_count + 1];

******* unsigned int sz = 0;

******* int n;

******* uint64_t remapped, unmapped;

******* unsigned long flags;



******* result[0] = '';

******* switch (type) {

******* ******* case STATUSTYPE_INFO:

******* ******* ******* DMEMIT("%d", pctx->dev_count);

******* ******* ******* for (n = 0; n < pctx->dev_count; n++)* {

******* ******* ******* ******* DMEMIT(" %s", pctx->dev_list[n].dmdev->name);

******* ******* ******* ******* buffer[n] = 'A';

******* ******* ******* }

******* ******* ******* buffer[n] = '';

******* ******* ******* spin_lock_irqsave(&pctx->spinlock, flags);

******* ******* ******* remapped = pctx->ios_remapped;

******* ******* ******* unmapped = pctx->ios_unmapped;

******* ******* ******* spin_unlock_irqrestore(&pctx->spinlock, flags);

******* ******* ******* DMEMIT(" 5 %s %08x %08x %lld %lld", buffer, pctx->userland[0], pctx->userland[1],

******* ******* ******* ******* ******* remapped, unmapped);

******* ******* ******* break;



******* ******* case STATUSTYPE_TABLE:

******* ******* ******* DMEMIT("%d %d", pctx->dev_count, pctx->page_size);

******* ******* ******* for (n = 0; n < pctx->dev_count; n++)* {

******* ******* ******* ******* DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,

******* ******* ******* ******* ******* ******* (unsigned long long) pctx->dev_list[n].start);

******* ******* ******* }

******* ******* ******* break;



******* ******* default:

******* ******* ******* return 0;

******* }

******* return 0;

}



/*

** Switch ioctl:

**

** Passthrough all ioctls to the first path.

**/

static int switch_ioctl(struct dm_target *ti, unsigned int cmd,

******* ******* unsigned long arg)

{

******* struct switch_ctx *pctx = (struct switch_ctx *) ti->private;

******* struct block_device *bdev;

******* fmode_t mode = 0;

*******

******* /* Sanity check */

******* if (unlikely(!pctx || !pctx->dev_list[0].dmdev ||

******* ******* ******* !pctx->dev_list[0].dmdev->bdev))

******* ******* return -EIO;



******* bdev = pctx->dev_list[0].dmdev->bdev;

******* mode = pctx->dev_list[0].dmdev->mode;

******* return __blkdev_driver_ioctl(bdev, mode, cmd, arg);

}



static struct target_type __g_switch_target = {

******* .name** = "switch",

******* .version= {1, 0, 0},

******* .module = THIS_MODULE,

******* .ctr*** = switch_ctr,

******* .dtr*** = switch_dtr,

******* .map*** = switch_map,

******* .status = switch_status,

******* .ioctl* = switch_ioctl,

};



// Generic Netlink attribute policy (single attribute, NETLINK_ATTR_MSG)

static struct nla_policy __g_attr_policy[NETLINK_ATTR_MAX + 1] =

{

******* [NETLINK_ATTR_MSG] = { .type = NLA_BINARY, .len = MAX_IPC_MSG_LEN },

};



// Define the Generic Netlink family

static struct genl_family __g_family =

{

******* .id *** ******* = GENL_ID_GENERATE,**** ******* ******* ******* // Assign channel when family is registered

******* .hdrsize ****** = 0,

******* .name** ******* = "DM_SWITCH",

******* .version ****** = 1,

******* .maxattr ****** = NETLINK_ATTR_MAX,

};



/*

** Generic Netlink socket read function that handles communication from the userland

** for downloading the page table.

**/

static int get_page_tbl(struct sk_buff *skb_2, struct genl_info *info)

{

******* uint32_t******* ******* ******* rc, pte_mask, pte_fields, ptbl_bytes, offset, size;

******* uint32_t******* ******* ******* status = 0;

******* unsigned long** ******* flags;

******* char ** ******* ******* ******* *mydata;

******* void*** ******* ******* ******* *msg_head;

******* struct nlattr** ******* *na;

******* struct sk_buff* ******* *skb;

******* struct switch_ctx****** *pctx, *next;

******* struct switch_ptbl***** *ptbl, *pnew;

******* IpcPgTable_t*** ******* *pgp;

******* IpcResponse_t** ******* resp;

******* dev_t** ******* ******* ******* dev;

******* static const char****** *invmsg = "Invalid Page Table message";



******* /*

******** * For each attribute there is an index in info->attrs which points to a nlattr structure

******** * in this structure the data is given

******** */

******* if (info == NULL) {

******* ******* printk(KERN_ERR "%s missing genl_info parameter
", __FUNCTION__);

******* ******* return 0;

******* }******

******* na = info->attrs[NETLINK_ATTR_MSG];

******* if (na == NULL) {

******* ******* printk(KERN_ERR "%s no info->attrs %i
", __FUNCTION__, NETLINK_ATTR_MSG);

******* ******* return 0;

******* }

******* mydata = (char *) nla_data(na);

******* if (mydata == NULL) {

******* ******* printk(KERN_ERR "%s error while receiving data
", __FUNCTION__);

******* ******* return 0;

******* }



******* /*

******** * Format the reply message.* Return positve error codes to userland.

******** */

******* skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);

******* if (skb == NULL) {

******* ******* printk(KERN_ERR "%s cannot allocate reply message
", __FUNCTION__);

******* ******* return 0;

******* }

******* msg_head = genlmsg_put(skb, 0, info->snd_seq, &__g_family, 0, NETLINK_CMD_GET_PAGE_TBL);

******* if (skb == NULL) {

******* ******* printk(KERN_ERR "%s cannot format reply message header
", __FUNCTION__);

******* ******* return 0;

******* }

******* pgp = (IpcPgTable_t *) mydata;

******* if (na->nla_len < sizeof(IpcPgTable_t)) {

******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: too short (%d)", invmsg, na->nla_len);

******* ******* status = EINVAL;

******* ******* goto failed_respond;

******* }

******* if ((pgp->page_offset + pgp->page_count) > pgp->page_total) {

******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: too many page table entries (%d > %d)",

******* ******* ******* ******* invmsg, (pgp->page_offset + pgp->page_count), pgp->page_total);

******* ******* status = EINVAL;

******* ******* goto failed_respond;

******* }

******* pte_mask = (1 << pgp->pte_bits) - 1;

******* if (((pgp->dev_count - 1) & (~pte_mask)) != 0) {

******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: invalid mask 0x%x for %d devices",

******* ******* ******* ******* invmsg, pte_mask, pgp->dev_count);

******* ******* status = EINVAL;

******* ******* goto failed_respond;

******* }

******* pte_fields = 32 / pgp->pte_bits;

******* size = ((pgp->page_count + pte_fields - 1) / pte_fields) * sizeof(uint32_t);

******* if ((sizeof(*pgp) - 1 + size) > na->nla_len) {

******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "Invalid Page Table message: incomplete messsage");

******* ******* status = EINVAL;

******* ******* goto failed_respond;

******* }



******* // Look for the corresponding switch context block to create or update the page table.

******* rc = 0;

******* dev = MKDEV(pgp->dev_major, pgp->dev_minor);

******* spin_lock_irqsave(&__g_spinlock, flags);

******* list_for_each_entry_safe(pctx, next, &__g_context_list, list) {

******* ******* if (dev == pctx->dev_this) {

******* ******* ******* rc = 1;

******* ******* ******* break;

******* ******* }

******* }

******* if (rc == 0) {

******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: invalid target device %d:%d",

******* ******* ******* ******* invmsg, pgp->dev_major, pgp->dev_minor);

******* ******* status = EINVAL;

******* ******* goto failed_unlock;

******* }



******* ptbl = pctx->ptbl;

******* if (* ( (ptbl != NULL) && (pgp->page_offset > (ptbl->ptbl_num + 1)) ) ||

******* ***** ( (ptbl == NULL) && (pgp->page_offset != 0) )* ) {

******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: missing entries", invmsg);

******* ******* status = EINVAL;

******* ******* goto failed_unlock;

******* }

******* //* Don't allow userland to change context parameters unless the page table is being rebuilt.

******* if (pgp->page_offset != 0) {

******* ******* if ((pgp->dev_count) != pctx->dev_count) {

******* ******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: invalid device count %d",

******* ******* ******* ******* ******* invmsg, pgp->dev_count);

******* ******* ******* status = EINVAL;

******* ******* ******* goto failed_respond;

******* ******* }

******* ******* if (ptbl != NULL) {

******* ******* ******* if (pgp->pte_bits != ptbl->pte_bits) {

******* ******* ******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: number of bits changed", invmsg);

******* ******* ******* ******* status = EINVAL;

******* ******* ******* ******* goto failed_unlock;

******* ******* ******* }

******* ******* ******* if (pgp->page_total != ptbl->ptbl_max) {

******* ******* ******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "%s: total number of entries changed", invmsg);

******* ******* ******* ******* status = EINVAL;

******* ******* ******* ******* goto failed_unlock;

******* ******* ******* }

******* ******* }

******* }



******* // Create a Page Table if needed.* Most of the time, the size of the table

******* // doesn't change.* In that case, re-use the existing table.

******* ptbl_bytes = ((pgp->page_total + pte_fields - 1) / pte_fields) * sizeof(uint32_t);

******* if ((ptbl != NULL) && (ptbl_bytes == ptbl->ptbl_bytes)) {

******* ******* pnew = ptbl;

******* }

******* else {

******* ******* pnew = kmalloc((sizeof(*pnew) + ptbl_bytes), GFP_KERNEL);

******* ******* if (pnew == NULL) {

******* ******* ******* snprintf(resp.err_str, sizeof(resp.err_str), "Cannot allocate Page Table");

******* ******* ******* status = EINVAL;

******* ******* ******* goto failed_unlock;

******* ******* }

******* ******* pnew->ptbl_bytes = ptbl_bytes;

******* }

******* pnew->pte_bits = pgp->pte_bits;

******* pnew->pte_mask = pte_mask;

******* pnew->pte_fields = pte_fields;

******* pnew->ptbl_max = pgp->page_total;

******* pnew->ptbl_num = pgp->page_offset + pgp->page_count;

******* offset = (pgp->page_offset + pte_fields - 1) / pte_fields;

******* memcpy(&pnew->ptbl_buff[offset], pgp->ptbl_buff, size);

******* pctx->userland[0] = pgp->userland[0];

******* pctx->userland[1] = pgp->userland[1];



******* if (pnew != ptbl) {

******* ******* rcu_assign_pointer(pctx->ptbl, pnew);

******* ******* if (ptbl != NULL)

******* ******* ******* kfree(ptbl);

******* }



failed_unlock:

******* spin_unlock_irqrestore(&__g_spinlock, flags);



failed_respond:

******* if (status != 0)

******* ******* printk("%s WARNING: %s
", __FUNCTION__, resp.err_str);



******* // Format the response message

******* resp.total_len = sizeof(IpcResponse_t);

******* resp.opcode = OPCODE_PAGE_TABLE_UPLOAD;

******* resp.userland[0] = pgp->userland[0];

******* resp.userland[1] = pgp->userland[1];

******* resp.dev_major = pgp->dev_major;

******* resp.dev_minor = pgp->dev_minor;

******* resp.status = status;

******* rc = nla_put(skb, NLA_BINARY, sizeof(IpcResponse_t), &resp);

******* if( rc != 0 ) {

******* ******* printk("%s WARNING: Cannot format reply message
", __FUNCTION__);

******* ******* return 0;

******* }

******* genlmsg_end(skb, msg_head);

******* rc = genlmsg_unicast(&init_net, skb, info->snd_pid);***

******* if( rc != 0 ) {

******* ******* printk("%s WARNING: Cannot send reply message
", __FUNCTION__);

******* ******* return 0;

******* }

******* return 0;

}



// Operation for getting the page table

static struct genl_ops __g_op_get_page_tbl =

{

******* .cmd ** = NETLINK_CMD_GET_PAGE_TBL,

******* .flags* = 0,

******* .policy = __g_attr_policy,

******* .doit * = get_page_tbl,

******* .dumpit = NULL,

};



/*

** Use the sysfs interface to inform the userland process of the family id to be used

** by the Generic Netlink socket.

**/

static ssize_t sysfs_familyid_show(struct kobject *kobj, struct attribute *attr, char *buff)

{

******* return snprintf(buff, PAGE_SIZE, "%d", __g_family.id);

}



static ssize_t sysfs_familyid_store(struct kobject *kobj, struct attribute *attr,

******* ******* const char *buff, size_t size)

{

******* return size;

}



static struct {

******* struct attribute attr;

******* struct sysfs_ops ops;

}

__g_sysfs_familyid = {

******* { "familyid", 0644 },

******* { &sysfs_familyid_show, &sysfs_familyid_store },

};



int __init dm_switch_init(void)

{

******* int r;



******* spin_lock_init(&__g_spinlock);

******* r = dm_register_target(&__g_switch_target);

******* if (r) {

******* ******* DMERR("dm_register_target() failed %d", r);

******* ******* return r;

******* }



******* // Initialize Generic Netlink communications

******* r = genl_register_family(&__g_family);

******* if (r) {

******* ******* DMERR("genl_register_family() failed");

******* ******* goto failed;

******* }

******* r = genl_register_ops(&__g_family, &__g_op_get_page_tbl);

******* if (r) {

******* ******* DMERR("genl_register_ops(get_page_tbl) failed %d", r);

******* ******* goto failed;

******* }

******* r = sysfs_create_file(&__g_switch_target.module->mkobj.kobj, &__g_sysfs_familyid.attr);

******* if (r) {

******* ******* DMERR("/sys/module/familyid create failed %d", r);

******* ******* goto failed;

******* }

******* return 0;

*******

failed:

******* dm_unregister_target(&__g_switch_target);

******* return r;

}



void dm_switch_exit(void)

{

******* int r;



******* dm_unregister_target(&__g_switch_target);

******* r = genl_unregister_family(&__g_family);

******* if (r)

******* ******* DMWARN("genl_unregister_family() failed %d", r);

******* return;

}



module_init(dm_switch_init);

module_exit(dm_switch_exit);






--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 01:40 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org