FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > EXT3 Users

 
 
LinkBack Thread Tools
 
Old 03-26-2011, 10:53 PM
"Ted Ts'o"
 
Default Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

On Sat, Mar 26, 2011 at 07:20:08PM -0400, Jidong Xiao wrote:
> Hi,
>
> I see many literatures mentioned this, but I have never seen any one
> explains it in detail.(Although this link exposed the original story:
> http://lkml.indiana.edu/hypermail//linux/kernel/0107.1/0364.html)
>
> "Journal mode: This mode is the slowest except when data needs to be
> read from and written to disk at the same time where it outperform all
> others mode."

I didn't see any reference to that in that mail thread (which seemed
to be mostly about reiserfs). It is true that you have a bursty,
fsync-heavy workload, you can reduce latency by using data=journal
mode, because it avoids seeks --- the data and metadata blocks are
written into the journal, and this allows the fsync() to finish more
quickly. There are some applications where this might be useful, such
as NFS file serving, where the NFS server is not allowed to send an
acknowledgement back to the client until the data is written to stable
store.

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-26-2011, 11:25 PM
Jidong Xiao
 
Default Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

On Sat, Mar 26, 2011 at 7:53 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Sat, Mar 26, 2011 at 07:20:08PM -0400, Jidong Xiao wrote:
>> Hi,
>>
>> I see many literatures mentioned this, but I have never seen any one
>> explains it in detail.(Although this link exposed the original story:
>> http://lkml.indiana.edu/hypermail//linux/kernel/0107.1/0364.html)
>>
>> "Journal mode: This mode is the slowest except when data needs to be
>> read from and written to disk at the same time where it outperform all
>> others mode."
>
> I didn't see any reference to that in that mail thread (which seemed
> to be mostly about reiserfs). *It is true that you have a bursty,
> fsync-heavy workload, you can reduce latency by using data=journal
> mode, because it avoids seeks --- the data and metadata blocks are
> written into the journal, and this allows the fsync() to finish more
> quickly. *There are some applications where this might be useful, such
> as NFS file serving, where the NFS server is not allowed to send an
> acknowledgement back to the client until the data is written to stable
> store.
>
> * * * * * * * * * * * * * * * * * * * * * * * * - Ted
>

Well, this first time when Andrew Morton claimed that data=journal
better than data=ordered in certain conditions was when he announced
the release of ext3-2.4-0.9.4:

http://www.redhat.com/archives/ext3-users/2001-July/msg00169.html

And the link I provided in the original email actually is source or
background of this story. This release was immediately after the
previous discussion.

But my question is, why data=journal could outperform data=ordered,
for the data=journal mode, you have to write the data and metadata
blocks into the journal, but for the data=ordered mode, you only have
to write the metadata blocks into the journal. If, in some certain
cases, the former mode can avoid seeks, then the same behavior should
apply to the latter mode. So it's really odd that the former mode can
outperform the latter mode.

Regards
Jidong

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-27-2011, 01:44 AM
"Ted Ts'o"
 
Default Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

On Sat, Mar 26, 2011 at 08:25:23PM -0400, Jidong Xiao wrote:
>
> But my question is, why data=journal could outperform data=ordered,
> for the data=journal mode, you have to write the data and metadata
> blocks into the journal, but for the data=ordered mode, you only have
> to write the metadata blocks into the journal. If, in some certain
> cases, the former mode can avoid seeks, then the same behavior should
> apply to the latter mode. So it's really odd that the former mode can
> outperform the latter mode.

When executing an fsync(), in data=ordered mode you have to write the
data data blocks into the journal and wait for the data blocks to be
written. This requires generally will require extra seeks. In
data=journaled mode, the data blocks can be written directly into the
sjoujournal without needing to seek.

Of course eventually the data and metadata blocks will need to be
written to their permanent locations before the journal space can be
reused. But for short bursty write patterns, the fsync() latency will
be much smaller in data=journal mode.

Regards,

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-27-2011, 04:52 AM
Jidong Xiao
 
Default Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

On Sat, Mar 26, 2011 at 10:44 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Sat, Mar 26, 2011 at 08:25:23PM -0400, Jidong Xiao wrote:
>>
>> But my question is, why data=journal could outperform data=ordered,
>> for the data=journal mode, you have to write the data and metadata
>> blocks into the journal, but for the data=ordered mode, you only have
>> to write the metadata blocks into the journal. If, in some certain
>> cases, the former mode can avoid seeks, then the same behavior should
>> apply to the latter mode. So it's really odd that the former mode can
>> outperform the latter mode.
>
> When executing an fsync(), in data=ordered mode you have to write the
> data data blocks into the journal and wait for the data blocks to be
> written. *This requires generally will require extra seeks. *In
> data=journaled mode, the data blocks can be written directly into the
> sjoujournal without needing to seek.
>
> Of course eventually the data and metadata blocks will need to be
> written to their permanent locations before the journal space can be
> reused. *But for short bursty write patterns, the fsync() latency will
> be much smaller in data=journal mode.
>

Thank you Ted, it is really helpful!

So the difference is:
data=ordered mode: fsync() will return only if the meta data blocks
have been written into the journal and the data blocks have been
written into the disk.
data=journal mode: fsync() returns if the meta data and data have been
written into the journal. The journal is contiguous, so data=journal
mode means no seeking needed, therefore, fsync() would return more
quicker.

If, we perform read from and write to the disk simultaneously, like
following example:

First, write data to the filesystem as quickly as possible:

Rapid writing

while true
do
dd if=/dev/zero of=largefile bs=16384 count=131072
done

While data was being written to the test filesystem, read 16Mb of data
from the same filesystem on the same disk, timing the results:

Reading a 16Mb file

time cat 16-meg-file > /dev/null

In this case, if we conduct the experiment in data=journal mode and
data=ordered mode respectively, since write latency is much smaller in
data=journal mode, the disk will focus more on the read operation,
hence, the read operation will also finish earlier than it do in the
data=ordered mode. Am I understanding correctly?

Regards
Jidong

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-28-2011, 04:43 PM
 
Default Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

[ ... ]

>> When executing an fsync(), in data=ordered mode you have to
>> write the data data blocks into the journal and wait for the
>> data blocks to be written. This requires generally will
>> require extra seeks. In data=journaled mode, the data blocks
>> can be written directly into the sjoujournal without needing
>> to seek.

>> Of course eventually the data and metadata blocks will need
>> to be written to their permanent locations before the journal
>> space can be reused. But for short bursty write patterns,
>> the fsync() latency will be much smaller in data=journal
>> mode.

> [ ... ]

> In this case, if we conduct the experiment in data=journal
> mode and data=ordered mode respectively,

That experiment is not necessarily demonstrative, it depends on
RAM caching, elevator, ...

> since write latency is much smaller in data=journal mode,

Write latency is actually much longer: because it requires *two*
writes instead of one. It is *fsync* latency as mentioned above
that is smaller, because it depends only on the first write to
what is in effect a small log based filesystem. This distinction
matters a great deal, because it is the reason why "short bursty
write patterns" is the qualification above. For long write
patterns things are very different as the journal eventually
fills up. For any given size it will also fill up a lot faster
for 'data=journal'.

Ahhh while writing that I have just realized that large journals
can be a bad idea especially for metadata operations. Will have
to think more about that.

> the disk will focus more on the read operation, hence, the
> read operation will also finish earlier than it do in the
> data=ordered mode. Am I understanding correctly?

That again depends on a lot of things, including caching, the
elevator, flusher behaviour, exactly where the files are...

ALso, whether the journal is on the same drive as the filesystem
or another drive can matter enormously; also whether for example
the journal is on SSD or battery backed RAM. There are reasons
why 'ext2' still quite outperforms 'ext3' on simple tests.

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 04-02-2011, 04:01 AM
Jidong Xiao
 
Default Ext3: Why data=journal is better than data=ordered when data needs to be read from and written to disk at the same time

On Mon, Mar 28, 2011 at 12:43 PM, Peter Grandi
<pg_ext3@ext3.for.sabi.co.uk> wrote:
> [ ... ]
>
>>> When executing an fsync(), in data=ordered mode you have to
>>> write the data data blocks into the journal and wait for the
>>> data blocks to be written. *This requires generally will
>>> require extra seeks. *In data=journaled mode, the data blocks
>>> can be written directly into the sjoujournal without needing
>>> to seek.
>
>>> Of course eventually the data and metadata blocks will need
>>> to be written to their permanent locations before the journal
>>> space can be reused. *But for short bursty write patterns,
>>> the fsync() latency will be much smaller in data=journal
>>> mode.
>
>> *[ ... ]
>
>> In this case, if we conduct the experiment in data=journal
>> mode and data=ordered mode respectively,
>
> That experiment is not necessarily demonstrative, it depends on
> RAM caching, elevator, ...
>
>> since write latency is much smaller in data=journal mode,
>
> Write latency is actually much longer: because it requires *two*
> writes instead of one. It is *fsync* latency as mentioned above
> that is smaller, because it depends only on the first write to
> what is in effect a small log based filesystem. This distinction
> matters a great deal, because it is the reason why "short bursty
> write patterns" is the qualification above. For long write
> patterns things are very different as the journal eventually
> fills up. For any given size it will also fill up a lot faster
> for 'data=journal'.
>
> Ahhh while writing that I have just realized that large journals
> can be a bad idea especially for metadata operations. Will have
> to think more about that.
>
Well, the experiment I described was actually taken from the following article,

http://www.ibm.com/developerworks/library/l-fs8.html?S_TACT=105AGX52&S_CMP=cn-a-l

The author claims that it is Andrew Morton who tested this and showed that
" data=journal mode allowed the 16-meg-file to be read from 9 to over
13 times faster than other ext3 modes, ReiserFS, and even ext2 (which
has no journaling overhead)". Although I cannot find the original
Andrew Morton's post in LKML, one fact is this article is widely
copied to many other websites.

Futhermore, in the kernel internal
document,Documentation/filesystems/ext3.txt, there is saying:

195 * journal mode
196 data=journal mode provides full data and metadata journaling. All
new data is
197 written to the journal first, and then to its final location.
198 In the event of a crash, the journal can be replayed, bringing both data and
199 metadata into a consistent state. This mode is the slowest except when data
200 needs to be read from and written to disk at the same time where it
201 outperforms all other modes.

Although Ted and you both explained that the fsync latency is shorter
in data=journal mode, my original question, as the title indicated, is
why data=journal outperforms the other modes when read and write
simultaneously? Or, this statement in the kernel doc is not
accurate?If so, then we should submit a patch and modify this document
so that the other people won't be mislead, and it would be better to
show people some more demonstrative examples in which data=journal
really outperforms the other modes.

In addition, I am actually not very clear why you said that write()
latency is longer while fsync() latency is shorter, I am trying to
repeat what you said, please point out if I am incorrect:
1. Normally we call write() syscall first and then call fsync() to
flush the data.
2. The write() returns as long as the data is written into page caches
while the fsync() returns only if the data have been written into a
stable store.
3. Although write() latency for data=journal mode is much longer
because it requires two writes instead of one, however, since the
write() means writing to page cache, so the actually cost is not so
high, compared to the fsync() syscall where we have to write into disk
and may require disk seeks. So we can mainly focus on the fsync()
system call.
4. Since the journal is a stable store, for the data=journal mode,
fsync() returns as long as the meta data and the real data have been
written into the journal file, and this process is sequential access.
But for the data=moded mode, fsync() will terminate only if the data
itself has been written into the disk, since this process is random
access, we do need many times of disk seeks, which is expensive, so in
this case, fsync() latency is much longer than the in the data=journal
mode. And that's why we claim that data=journal wins for this burst
write case.

Are these correct?

Regards
Jidong

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 

Thread Tools




All times are GMT. The time now is 06:49 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org