FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > EXT3 Users

 
 
LinkBack Thread Tools
 
Old 11-01-2010, 09:58 PM
Alex Bligh
 
Default How to generate a large file allocating space

--On 1 November 2010 15:45:12 -0600 Andreas Dilger
<adilger.kernel@dilger.ca> wrote:



What is it you really want to do in the end? Shared concurrent writers
to the same file? High-bandwidth IO to the underlying disk?


High bandwidth I/O to the underlying disk is part of it - only one
reader/writer per file. We're really using ext4 just for its extents
capability, i.e. allocating space, plus the convenience of directory
lookup to find the set of extents.

It's easier to do this than to write this bit from scratch, and the
files are pretty static in size (i.e. they only grow, and grow
infrequently by large amounts). The files on ext4 correspond to large
chunks of disks we are combining together using an device-mapper
type thing (but different), and on top of that lives arbitary real
filing systems. Because our device-mapper type thing already
understands what blocks have been written to, we already have a layer
that prevents the data on the disk before the file's creation being
exposed. That's why I don't need ext4 to zero them out. I suppose
in that sense it is like the swap file case.

Oh, and because these files are allocated infrequently, I am not
/that/ concerned about performance (famous last words). The performance
critical stuff is done via direct writes to the SAN and don't even
pass through ext4 (or indeed through any single host).

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-02-2010, 12:49 AM
"Ted Ts'o"
 
Default How to generate a large file allocating space

On Mon, Nov 01, 2010 at 10:58:12PM +0000, Alex Bligh wrote:
> High bandwidth I/O to the underlying disk is part of it - only one
> reader/writer per file. We're really using ext4 just for its extents
> capability, i.e. allocating space, plus the convenience of directory
> lookup to find the set of extents.
>
> It's easier to do this than to write this bit from scratch, and the
> files are pretty static in size (i.e. they only grow, and grow
> infrequently by large amounts). The files on ext4 correspond to large
> chunks of disks we are combining together using an device-mapper
> type thing (but different), and on top of that lives arbitary real
> filing systems. Because our device-mapper type thing already
> understands what blocks have been written to, we already have a layer
> that prevents the data on the disk before the file's creation being
> exposed. That's why I don't need ext4 to zero them out. I suppose
> in that sense it is like the swap file case.

But why not just use O_DIRECT? Do you really need to access the
disk directly, as opposed to using O_DIRECT?

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-02-2010, 02:21 AM
Andreas Dilger
 
Default How to generate a large file allocating space

On 2010-11-01, at 16:58, Alex Bligh wrote:
> --On 1 Nov 2010 15:45:12 Andreas Dilger <adilger.kernel@dilger.ca> wrote:
>> What is it you really want to do in the end? Shared concurrent writers
>> to the same file? High-bandwidth IO to the underlying disk?
>
> High bandwidth I/O to the underlying disk is part of it - only one
> reader/writer per file. We're really using ext4 just for its extents
> capability, i.e. allocating space, plus the convenience of directory
> lookup to find the set of extents.
>
> It's easier to do this than to write this bit from scratch, and the
> files are pretty static in size (i.e. they only grow, and grow
> infrequently by large amounts). The files on ext4 correspond to large
> chunks of disks we are combining together using an device-mapper
> type thing (but different), and on top of that lives arbitary real
> filing systems. Because our device-mapper type thing already
> understands what blocks have been written to, we already have a layer
> that prevents the data on the disk before the file's creation being
> exposed. That's why I don't need ext4 to zero them out. I suppose
> in that sense it is like the swap file case.
>
> Oh, and because these files are allocated infrequently, I am not
> /that/ concerned about performance (famous last words). The performance
> critical stuff is done via direct writes to the SAN and don't even
> pass through ext4 (or indeed through any single host).

Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well.

Cheers, Andreas






_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-02-2010, 06:58 AM
Alex Bligh
 
Default How to generate a large file allocating space

Ted,

On 2 Nov 2010, at 01:49, "Ted Ts'o" <tytso@mit.edu> wrote:

> But why not just use O_DIRECT? Do you really need to access the
> disk directly, as opposed to using O_DIRECT?
>
Because more than one machine will be accessing the data on the ext4 volume (over iSCSI), though access to the large files is mediated by locks higher up. To use O_DIRECT each accessing machine would need to have the volume mounted, rather than merely receiving a list of extents.

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-02-2010, 07:01 AM
Alex Bligh
 
Default How to generate a large file allocating space

On 2 Nov 2010, at 03:21, Andreas Dilger <adilger.kernel@dilger.ca> wrote:

>
> Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well.

Unfortunately I need something not prototype. Fortunately I don't need many of lustre's or ceph's features.

--
Alex Bligh


_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-02-2010, 10:20 AM
Ric Wheeler
 
Default How to generate a large file allocating space

On 11/02/2010 04:01 AM, Alex Bligh wrote:

On 2 Nov 2010, at 03:21, Andreas Dilger<adilger.kernel@dilger.ca> wrote:


Actually, I think Ceph has a network block-device feature (recently submitted/committed to mainline), and Lustre has a prototype block-device feature as well.

Unfortunately I need something not prototype. Fortunately I don't need many of lustre's or ceph's features.



Sounds like you will end up writing something brand new - much less stable than
any of the options mentioned in the thread previously.


Ric

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-02-2010, 04:37 PM
Alex Bligh
 
Default How to generate a large file allocating space

--On 2 November 2010 07:20:48 -0400 Ric Wheeler <rwheeler@redhat.com> wrote:


Sounds like you will end up writing something brand new - much less
stable than any of the options mentioned in the thread previously.


Well, the new component will be something simple. All I really
need to know is how to mark the inodes as allocated and initialised,
rather than unwritten.

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-04-2010, 11:46 AM
Bodo Thiesen
 
Default How to generate a large file allocating space

Hello Alex, hello Andreas

* Andreas Dilger <adilger@dilger.ca> hat geschrieben:
>* On 2010-10-31, at 09:05, Alex Bligh wrote:
>> I am trying to allocate huge files on ext4. I will then read the extents
>> within the file and write to the disk at a block level rather than using
>> ext4 (the FS will not be mounted at this point). This will allow me to
>> have several iSCSI clients hitting the same LUN r/w safely. And at
>> some point when I know the relevant iSCSI stuff has stopped and been
>> flushed to disk, I may unlink the file.

Question: Did you consider using plain LVM for this purpose? By creating a
logical volume, no data is initialized, only the meta data is created
(what seems to be exactly what you need). Then, each client may access one
logical volume r/w. Retrieving the extents list is very easy as well. And
because there are no group management data (cluster bitmaps, inode bitmaps
and tables) of any kind, you will end up with only one single extent in
most cases regardless of the size of the volume you've created.

> Hmm, why not simply use a cluster filesystem to do this?
>
> GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),

They are SUPPOSED to do that - in theory. The last two weekends I tried to
set up a stable DRDB+GFS2 setup - I failed. Then I tried OCFS2 - again I
failed. The setup was simple: Two identical Systems with 10*500GB disks
and a hardware RAID6 yielding 4GB user disk space. That was used to create
a DRDB (no LVM or other stuff like crypto in betreen). Both were set to
primary and then I created GFS2 (later OCFS2) and started the additional
tools like clvm/o2bc. Then mounting the file systems on both machines -
everything worked up to here.

machine1: dd if=/dev/zero of=/mnt/4tb/file1
machine2: dd if=/dev/zero of=/mnt/4tb/file2

Worked well in both setups on both machines

machine1: let i=0; while let i=i+1; do echo "A$i" >> /mnt/4tb/file3; done
machine2: let i=0; while let i=i+1; do echo "B$i" >> /mnt/4tb/file3; done

GFS2: First machine works well, second machine starts returning EIO on
*ANY* request (even ls /mnt/4tb). Umount impossible. Had to reboot ->
#gfs2 #fail
OCFS2: passed this test as well as the next one

machine1: let i=0; while let i=i+1; do echo "A$i"; done >> /mnt/4tb/file4
machine2: let i=0; while let i=i+1; do echo "B$i"; done >> /mnt/4tb/file4

Then I rebooted one machine with "echo b > /proc/sysrq-trigger" while the
last test was still in progress. Guess what: The other machine stopped
working. No reads, no writes. It didn't evern go on when the first machine
came back. I had then to reboot the second one as well to continue using
the file system.

Maybe I did something wrong, maybe the file systems just aren't as stable
as we expected them to be, anyways, we decided now to use stable systems,
i.e. drbd in primary/secondary setup and ext3 with failover to the other
system if the primary goes down, and as the system already went
productive, we're not gonna change anything here in the near future. So
consider this report as strictly informative.

BTW: No, I do not longer have the config files, I didn't save them and the
systems have been completely reinstalled after testing the final setup
succeeded to wipe out everything left over from the previous attempts.

Regards, Bodo

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-04-2010, 03:16 PM
"Ted Ts'o"
 
Default How to generate a large file allocating space

On Tue, Nov 02, 2010 at 07:58:02AM +0000, Alex Bligh wrote:
> On 2 Nov 2010, at 01:49, "Ted Ts'o" <tytso@mit.edu> wrote:
> > But why not just use O_DIRECT? Do you really need to access the
> > disk directly, as opposed to using O_DIRECT?
> >
> Because more than one machine will be accessing the data on the ext4
> volume (over iSCSI), though access to the large files is mediated by
> locks higher up. To use O_DIRECT each accessing machine would need
> to have the volume mounted, rather than merely receiving a list of
> extents.

Well, I would personally not be against an extension to fallocate()
where if the caller of the syscall specifies a new flag, that might be
named FALLOC_FL_EXPOSE_OLD_DATA, and if the caller either has root
privs or (if capabilities are enabled) CAP_DAC_OVERRIDE &&
CAP_MAC_OVERRIDE, it would be able to allocate blocks whose extents
would be marked as initialized without actually initializing the
blocks first.

I don't know whether it will get past the fs-devel bike shed painting
crew, but I do have some cluster file system users who would like
something similar. In their case they will be writing the files using
Direct I/O, and the objects are all checksumed at the cluster file
system level, and if the object has the wrong checksum, then the
cluster file system will ask another server for the object. Since the
cluster file system is considered trusted, and it verifies the
expected object checksum before releasing the data, there is no
security issue.

You do realize, though, that it sounds like with your design you are
replicating the servers, but not the disk devices --- so if your disk
device explodes, you're Sadly Out of Luck. Sure you can use
super-expensive storage arrays, but if you're writing your own cluster
file system, why not create a design which uses commodity disks and
worry about replicating data across servers at the cluster file system
level?

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 11-04-2010, 05:22 PM
Alex Bligh
 
Default How to generate a large file allocating space

--On 4 November 2010 13:46:38 +0100 Bodo Thiesen <bothie@gmx.de> wrote:


<adilger@dilger.ca> hat geschrieben:

* On 2010-10-31, at 09:05, Alex Bligh wrote:

I am trying to allocate huge files on ext4. I will then read the extents
within the file and write to the disk at a block level rather than using
ext4 (the FS will not be mounted at this point). This will allow me to
have several iSCSI clients hitting the same LUN r/w safely. And at
some point when I know the relevant iSCSI stuff has stopped and been
flushed to disk, I may unlink the file.


Question: Did you consider using plain LVM for this purpose?
By creating a
logical volume, no data is initialized, only the meta data is created
(what seems to be exactly what you need). Then, each client may access one
logical volume r/w. Retrieving the extents list is very easy as well. And
because there are no group management data (cluster bitmaps, inode bitmaps
and tables) of any kind, you will end up with only one single extent in
most cases regardless of the size of the volume you've created.


Plain LVM or Clustered LVM? Clustered LVM has some severe limitations,
including needing to restart the entire cluster to add nodes, which
is not acceptable.

Plain LVM has two types of issue:

1. Without clustered LVM, as far as I can tell there is no locking
of metadata. I have no guarantees that access to the disk does not
go outside the LV's allocation. For instance, when a CoW snapshot is
written to and expanded, the metadata must be written to, and there
is no locking for that.

2. Snapshots suffer severe limitations. For instance,
it is not possible to generate arbitrarily deep trees of snapshots
(i.e. CoW on top of CoW) without an arbitrarily deep tree of loopback
mounted lvm devices, which does not sound like a good idea.

I think you can only use lvm like this where you have simple volumes
mounted, and in essence take no snapshots.

To answer the implied question, yes we have a (partial) lvm replacement.


GFS and OCFS both handle shared writers for the same SAN disk (AFAIK),


They are SUPPOSED to do that - in theory


We have had similar experiences and don't actually need all the features
(and thus complexity) that a true clustered filing system presents.

--
Alex Bligh

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 

Thread Tools




All times are GMT. The time now is 12:17 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org