FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > EXT3 Users

 
 
LinkBack Thread Tools
 
Old 11-04-2010, 05:29 PM
Alex Bligh
 
Default How to generate a large file allocating space

Ted,

--On 4 November 2010 12:16:13 -0400 Ted Ts'o <tytso@mit.edu> wrote:


Well, I would personally not be against an extension to fallocate()
where if the caller of the syscall specifies a new flag, that might be
named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root
privs or (if capabilities are enabled) CAP_DAC_OVERRIDE &&
CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents
would be marked as initialized without actually initializing the
blocks first.


That sounds a lot like "send patches" which I just might do, if only
to gain better understanding as to what is going on.

I seem to remember (from lwn's summary of lkml) that the proposed
options for fallocate() got a bit baroque to start with, and people
then simplified down to zero options. Perhaps that was a simplification
too far.

In the mean time, particularly as I'd ideally like to avoid a kernel
modification, is there a safe way I could use or modify the ext2
library to run through the extents of a fallocated() file and clear
the "unwritten" bit? If I clear that (which from memory is the top
bit of the extent length), is that alone safe? (on an unmounted
file system, obviously).


You do realize, though, that it sounds like with your design you are
replicating the servers, but not the disk devices --- so if your disk
device explodes, you're Sadly Out of Luck. Sure you can use
super-expensive storage arrays, but if you're writing your own cluster
file system, why not create a design which uses commodity disks and
worry about replicating data across servers at the cluster file system
level?


The particular use case here is for customers that have sunk huge
amounts of money into expensive storage arrays, or for whatever
reason have an aversion to storing anything on anything other than
expensive storage arrays.

I would tend to agree that replicating across commodity disks is
in almost all cases a better technological solution, but the
technology is still further away from readiness there. Sadly
technological arguments don't always win the day, and we need
something in the mean time...

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-04-2010, 06:17 PM
"Ted Ts'o"
 
Default How to generate a large file allocating space

On Thu, Nov 04, 2010 at 06:29:47PM +0000, Alex Bligh wrote:
>
> >Well, I would personally not be against an extension to fallocate()
> >where if the caller of the syscall specifies a new flag, that might be
> >named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root
> >privs or (if capabilities are enabled) CAP_DAC_OVERRIDE &&
> >CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents
> >would be marked as initialized without actually initializing the
> >blocks first.
>
> That sounds a lot like "send patches" which I just might do, if only
> to gain better understanding as to what is going on.

Patches to do this wouldn't be that hard. The harder part would
probably be the politics on fs-devel regarding the semantics of
FALLOC_FL_EXPOSE_OLD_DATA.

> I seem to remember (from lwn's summary of lkml) that the proposed
> options for fallocate() got a bit baroque to start with, and people
> then simplified down to zero options. Perhaps that was a simplification
> too far.

It was simplified down to one flag. But that means we have a flags
field we can use to extend fallocate.

> In the mean time, particularly as I'd ideally like to avoid a kernel
> modification, is there a safe way I could use or modify the ext2
> library to run through the extents of a fallocated() file and clear
> the "unwritten" bit? If I clear that (which from memory is the top
> bit of the extent length), is that alone safe? (on an unmounted
> file system, obviously).

Yes, there most certainly is. The functions you'd probably want to
use are ext2fs_extent_open(), and then either use ext2fs_extent_goto()
to go to a specific extent, use ext2fs_extent_get() with the
EXT2_EXTENT_NEXT operation to iterate over the extents, and then use
ext2fs_extent_replace() to mutate the extent. Oh, and then use
ext2fs_extent_close() when you're done looking at and/or changing the
extents of a file.

If you build tst_extents in lib/ext2fs, you can use commands like
"inode" (to open the extents for a particular inode), and "root",
"current", "next", "prev", "next_leaf", "prev_leaf", "next_sibling",
"prev_sibling", "delete_node", "insert_node", "replace_node",
"split_node", "print_all", "goto", etc. Please don't use this in
production, but it's not a bad way to play with an extent tree, either
for learning purposes or to create test cases. tst_extents.c is also
a good way of seeing how the various libext2fs extent API's work.

> I would tend to agree that replicating across commodity disks is
> in almost all cases a better technological solution, but the
> technology is still further away from readiness there. Sadly
> technological arguments don't always win the day, and we need
> something in the mean time...

Well, things like Hadoopfs exist today, and Ceph (if you need a
POSIX-level access) is admittedly less stable. But if you're starting
from scratch, wouldn't that be pretty far away from readiness as well?

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-04-2010, 10:05 PM
Bodo Thiesen
 
Default How to generate a large file allocating space

Hello Alex

* Alex Bligh <alex@alex.org.uk> hat geschrieben:
>* --On 4 November 2010 13:46:38 +0100 Bodo Thiesen <bothie@gmx.de> wrote:
>> Question: Did you consider using plain LVM for this purpose?
>> By creating a
>> logical volume, no data is initialized, only the meta data is created
>> (what seems to be exactly what you need). Then, each client may access one
>> logical volume r/w. Retrieving the extents list is very easy as well. And
>> because there are no group management data (cluster bitmaps, inode bitmaps
>> and tables) of any kind, you will end up with only one single extent in
>> most cases regardless of the size of the volume you've created.
> Plain LVM or Clustered LVM? Clustered LVM has some severe limitations,
> including needing to restart the entire cluster to add nodes, which
> is not acceptable.
>
> Plain LVM has two types of issue:
>
> 1. Without clustered LVM, as far as I can tell there is no locking
> of metadata.

Possible (I don't know exactly)

> I have no guarantees that access to the disk does not
> go outside the LV's allocation.

In LVM you create one logical volume. In the process of creating that
volume, metadata get's updated. But just using the pre-existing logical
volumes doesn't change the metadata. So, if you do all creation and
removing of logical volumes on the same node, then you shouldn't get any
problems here. "lvchange -a[yn] $lv" doesn't even change the metadata,
it's a completely local operation (the local lvm cache get's updated, but
that's all). So, if you provide access via nbd or something like that to
the pv, all nodes could just use their portion of the lv without any
problems. Besides: You wanted to use ext4. I suggested to use lvm in the
same way you initially wanted to use ext4. So: On the main node you use
the command "lvdisplay -v $lv" (or thatever the exact command line is) and
you get a list of extents as result. Then you transfer that list to the
client and it can access the disk directly without issuing any lvm command
at all.

> For instance, when a CoW snapshot is
> written to and expanded, the metadata must be written to, and there
> is no locking for that.

Right, but that was not part of your use-case. If you need such things,
you can't use ext4 as well.

> 2. Snapshots suffer severe limitations. For instance,
> it is not possible to generate arbitrarily deep trees of snapshots
> (i.e. CoW on top of CoW) without an arbitrarily deep tree of loopback
> mounted lvm devices, which does not sound like a good idea.
>
> I think you can only use lvm like this where you have simple volumes
> mounted, and in essence take no snapshots.

Yea, and I mentioned lvm, because that was exactly your use-case

> To answer the implied question, yes we have a (partial) lvm replacement.

---> Did you consider using plain LVM for this purpose? <---

That was an explicit question

>>> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),
>> They are SUPPOSED to do that - in theory
> We have had similar experiences and don't actually need all the features
> (and thus complexity) that a true clustered filing system presents.

Ok, so not my fault

Regards, Bodo

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-05-2010, 07:08 AM
Alex Bligh
 
Default How to generate a large file allocating space

--On 5 November 2010 00:05:45 +0100 Bodo Thiesen <bothie@gmx.de> wrote:


For instance, when a CoW snapshot is
written to and expanded, the metadata must be written to, and there
is no locking for that.


Right, but that was not part of your use-case. If you need such things,
you can't use ext4 as well.


I should have been clearer. We aren't using ext4 as anything other than
a block store. The CoW snapshots are done using our LVM replacement
type thing which stores metadata in such a way that it safe to access
it from multiple readers/writers. It would be lovely to use LVM for
this, but not (as far as I can tell) possible.

I might have another look at using lvm as a blockstore, then running our
stuff inside lvm. But I didn't think lvm was capable of running thousands
of LVs per volume group. ext4 is just fine for that. Perhaps I am
slating lvm unfairly.

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-05-2010, 07:14 AM
Alex Bligh
 
Default How to generate a large file allocating space

Ted,

--On 4 November 2010 15:17:34 -0400 Ted Ts'o <tytso@mit.edu> wrote:


Patches to do this wouldn't be that hard. The harder part would
probably be the politics on fs-devel regarding the semantics of
FALLOC_FL_EXPOSE_OLD_DATA.


Also presumably there would be some pressure to make it work for
every filesystem that supported fallocate().


In the mean time, particularly as I'd ideally like to avoid a kernel
modification, is there a safe way I could use or modify the ext2
library to run through the extents of a fallocated() file and clear
the "unwritten" bit? If I clear that (which from memory is the top
bit of the extent length), is that alone safe? (on an unmounted
file system, obviously).


Yes, there most certainly is. The functions you'd probably want to
use are ext2fs_extent_open(), and then either use ext2fs_extent_goto()
to go to a specific extent, use ext2fs_extent_get() with the
EXT2_EXTENT_NEXT operation to iterate over the extents, and then use
ext2fs_extent_replace() to mutate the extent. Oh, and then use
ext2fs_extent_close() when you're done looking at and/or changing the
extents of a file.

If you build tst_extents in lib/ext2fs, you can use commands like
"inode" (to open the extents for a particular inode), and "root",
"current", "next", "prev", "next_leaf", "prev_leaf", "next_sibling",
"prev_sibling", "delete_node", "insert_node", "replace_node",
"split_node", "print_all", "goto", etc. Please don't use this in
production, but it's not a bad way to play with an extent tree, either
for learning purposes or to create test cases. tst_extents.c is also
a good way of seeing how the various libext2fs extent API's work.


Thaks, that's really helpful. Are the extents always the leaves? IE
will next_leaf take me through extent by extent?

Does your "please don't use this in production" warning apply to
tst_extents.c or to the whole of lib/ext2fs? The library calls
seem quite a good way to get the list of extents and are
presumably what fsck etc. use.


Well, things like Hadoopfs exist today, and Ceph (if you need a
POSIX-level access)


No, just block layer access fortunately


is admittedly less stable. But if you're starting
from scratch, wouldn't that be pretty far away from readiness as well?


The idea was to base as much as possible on existing running code (e.g.
ext4) with as few variations as possible. I'd be very surprised if we
end up exceeding a few thousand lines of code. All the cluster, lock
management etc we are borrowing from elsewhere, for instance.

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-05-2010, 10:32 AM
Bodo Thiesen
 
Default How to generate a large file allocating space

* Alex Bligh <alex@alex.org.uk> hat geschrieben:
> I might have another look at using lvm as a blockstore, then running our
> stuff inside lvm. But I didn't think lvm was capable of running thousands
> of LVs per volume group. ext4 is just fine for that. Perhaps I am
> slating lvm unfairly.

The number of logical volumes you can create should be mostly dependand on
the size of the metadata area. A short look on man pvcreate revealed the
command line argument --metadatasize size. Besides of this, lvm should be
able to handle any arbitrary number of logical volumes as long as the
metadata area is big enough to hold the new config. (The same applies to
ext2 and ext3 - if you don't have inodes left, you can't create new files
even with thousands of free terabytes - don't know, if this limitation
still exists in ext4, I'd guess "yes".)

So, my tip would be to just create a pv with a very bit metadata
size (i.e. 512 MB or even bigger) and write a script to create a few
thousand pv on that pv, something like this

pvcreate --metadatasize 512M /dev/foobar
lvcreate foobars /dev/foobar
for i in $(seq 1 1 5000)
do
lvcreate --size 256M -n foobar$i foobars
done

Either it works - or not ...

Regards, Bodo

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-06-2010, 03:30 PM
"Ted Ts'o"
 
Default How to generate a large file allocating space

On Fri, Nov 05, 2010 at 08:14:56AM +0000, Alex Bligh wrote:
>
> >Patches to do this wouldn't be that hard. The harder part would
> >probably be the politics on fs-devel regarding the semantics of
> >FALLOC_FL_EXPOSE_OLD_DATA.
>
> Also presumably there would be some pressure to make it work for
> every filesystem that supported fallocate().

No, I don't think so. There are plenty of file systems that don't
support fallocate(), and it's not a short step to consider adding new
flags which might not be supported by all.

> Thaks, that's really helpful. Are the extents always the leaves? IE
> will next_leaf take me through extent by extent?

Yes, to both questions.

> Does your "please don't use this in production" warning apply to
> tst_extents.c or to the whole of lib/ext2fs? The library calls
> seem quite a good way to get the list of extents and are
> presumably what fsck etc. use.

No, only to tst_extents.c. It has a tst_ prefix precisely because
it's a little hacky, and it was something that I had never intended to
be installed by distributions. (I got a little burned by "filefrag",
which was never intended to be installed at distribution, which is why
the code is so hackish, and why it's not internationalized, etc.) I
just want to make sure tst_extents doesn't similarly escape.

The libext2fs is designed to be a production-quality codebase, with a
stable ABI. So feel free to use it in good health. :-)

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-06-2010, 06:44 PM
Alex Bligh
 
Default How to generate a large file allocating space

--On 6 November 2010 12:30:21 -0400 Ted Ts'o <tytso@mit.edu> wrote:


On Fri, Nov 05, 2010 at 08:14:56AM +0000, Alex Bligh wrote:


> Patches to do this wouldn't be that hard. The harder part would
> probably be the politics on fs-devel regarding the semantics of
> FALLOC_FL_EXPOSE_OLD_DATA.

Also presumably there would be some pressure to make it work for
every filesystem that supported fallocate().


No, I don't think so. There are plenty of file systems that don't
support fallocate(), and it's not a short step to consider adding new
flags which might not be supported by all.


Thanks. I might have a go. Patches to linux-ext4@ ?


Thaks, that's really helpful. Are the extents always the leaves? IE
will next_leaf take me through extent by extent?


Yes, to both questions.


Does your "please don't use this in production" warning apply to
tst_extents.c or to the whole of lib/ext2fs? The library calls
seem quite a good way to get the list of extents and are
presumably what fsck etc. use.


No, only to tst_extents.c.

...

The libext2fs is designed to be a production-quality codebase, with a
stable ABI. So feel free to use it in good health. :-)


Again, thanks for that.

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 

Thread Tools




All times are GMT. The time now is 03:17 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org