FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo User

 
 
LinkBack Thread Tools
 
Old 09-09-2010, 05:24 PM
Matt Neimeyer
 
Default Pipe Lines - A really basic question

My generic question is: When I'm using a pipe line series of commands
do I use up more/less space than doing things in sequence?

For example, I have a development Gentoo VM that has a hard drive that
is too small... I wanted to move a database off of that onto another
machine but when I tried the following I filled my partition and 'evil
things' happened...

mysqldump blah...
gzip blah...

In this specific case I added another virtual drive, mounted that and
went on with life but I'm curious if I could have gotten away with the
pipe line instead. Will doing something like this still use "twice"
the space?

mysqldump | gzip > file.sql.gz

OR going back to my generic question if I pipe line like "type | sort
| unique > output" does that only use 1x or 3x the disk space?

Thanks in advance!

Matt

P.S. If the answer is "it depends" how do know what it depends on?
 
Old 09-09-2010, 06:03 PM
Etaoin Shrdlu
 
Default Pipe Lines - A really basic question

On Thu, 9 Sep 2010 13:24:16 -0400 Matt Neimeyer <matt@neimeyer.org> wrote:

> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?
>
> For example, I have a development Gentoo VM that has a hard drive that
> is too small... I wanted to move a database off of that onto another
> machine but when I tried the following I filled my partition and 'evil
> things' happened...
>
> mysqldump blah...
> gzip blah...
>
> In this specific case I added another virtual drive, mounted that and
> went on with life but I'm curious if I could have gotten away with the
> pipe line instead. Will doing something like this still use "twice"
> the space?
>
> mysqldump | gzip > file.sql.gz
>
> OR going back to my generic question if I pipe line like "type | sort
> | unique > output" does that only use 1x or 3x the disk space?
>
> Thanks in advance!
>
> Matt
>
> P.S. If the answer is "it depends" how do know what it depends on?

Pipes live in memory and do not take any disk space. Doing the same
operations one after another instead of using pipes instead usually needs
temporary file, which *do* take disk space.
 
Old 09-09-2010, 06:25 PM
Andrea Conti
 
Default Pipe Lines - A really basic question

> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?

When you use a pipe you don't need the space to store intermediate
results between the two programs. Thepipe is backed by a small
system-allocated RAM buffer (4k under linux AFAIK) and program execution
is controlled according to the amount of data in the buffer.

Not having to save intermediate results generally means that you will
need less disk space: this is especially true in the mysqldump-gzip
example as the uncompressed dump will not be written to the disk at any
stage.

Note however (this is the "it depends" part that piping does not
affect whatever the programs might allocate or save internally: in your
second example (which does not involve any disk writing in either case)
"sort" needs to see the complete input before producing any output, so
it will allocate enough memory to store it whether it is invoked alone
or as part of a pipeline (in which case it will also stall the
downstream pipeline section until the upstream pipe is closed).

HTH,
andrea
 
Old 09-09-2010, 07:09 PM
Florian Philipp
 
Default Pipe Lines - A really basic question

Am 09.09.2010 19:24, schrieb Matt Neimeyer:
> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?
>
[...]
> OR going back to my generic question if I pipe line like "type | sort
> | unique > output" does that only use 1x or 3x the disk space?
>
> Thanks in advance!
>
> Matt
>
> P.S. If the answer is "it depends" how do know what it depends on?
>

It depends on whether you use MS-DOS or a better OS
DOS was the last operating system which I know of which used temporary
files for pipes. Every other system uses in-memory FIFOs
(first-in-first-out).

BTW: your last example "type | sort | uniq" can be shortened to "type |
sort -u"

Hope this helps,
Florian Philipp
 
Old 09-09-2010, 07:19 PM
Florian Philipp
 
Default Pipe Lines - A really basic question

Am 09.09.2010 20:25, schrieb Andrea Conti:
> Note however (this is the "it depends" part that piping does not
> affect whatever the programs might allocate or save internally: in your
> second example (which does not involve any disk writing in either case)
> "sort" needs to see the complete input before producing any output, so
> it will allocate enough memory to store it whether it is invoked alone
> or as part of a pipeline (in which case it will also stall the
> downstream pipeline section until the upstream pipe is closed).
>

When you look closer at `sort`, it is actually a quite impressive tool.
It sorts in-memory for small amounts of data and switches to temporary
files for larger. It can even compress those files to save disk space.

And it is still faster than most "business-grade" software for importing
data into data warehouses.

Throw `cut`, `paste`, `join` and `grep` into the mix and you can build
your own relational database system based on shell scripts
 
Old 09-09-2010, 08:28 PM
Grant Edwards
 
Default Pipe Lines - A really basic question

On 2010-09-09, Florian Philipp <lists@f_philipp.fastmail.net> wrote:

> When you look closer at `sort`, it is actually a quite impressive
> tool. It sorts in-memory for small amounts of data and switches to
> temporary files for larger. It can even compress those files to save
> disk space.
>
> And it is still faster than most "business-grade" software for
> importing data into data warehouses.
>
> Throw `cut`, `paste`, `join` and `grep` into the mix and you can
> build your own relational database system based on shell scripts

Sort of linke /rdb: http://www.rdb.com/

--
Grant Edwards grant.b.edwards Yow! Am I in Milwaukee?
at
gmail.com
 
Old 09-09-2010, 08:46 PM
Daniel Troeder
 
Default Pipe Lines - A really basic question

On 09/09/2010 07:24 PM, Matt Neimeyer wrote:
> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?
>
> For example, I have a development Gentoo VM that has a hard drive that
> is too small... I wanted to move a database off of that onto another
> machine but when I tried the following I filled my partition and 'evil
> things' happened...
>
> mysqldump blah...
> gzip blah...
>
> In this specific case I added another virtual drive, mounted that and
> went on with life but I'm curious if I could have gotten away with the
> pipe line instead. Will doing something like this still use "twice"
> the space?
>
> mysqldump | gzip > file.sql.gz
>
> OR going back to my generic question if I pipe line like "type | sort
> | unique > output" does that only use 1x or 3x the disk space?
>
> Thanks in advance!
>
> Matt
>
> P.S. If the answer is "it depends" how do know what it depends on?
>
Everyone already answered the disk space question. I want to add just
this: It also saves you lots of i/o-bandwidth: only the compressed data
gets written to disk. As i/o is the most common bottleneck, it is often
an imperative to do as much as possible in a pipe. If you're lucky it
can also mean, that multiple programs run at the same time, resulting in
higher throughput. Lucky is, when consumer and producer (right and left
of pipe) can work simultaneously because the buffer is big enough. You
can see this every time you (un)pack a tar.gz.

Bye,
Daniel


--
PGP key @ http://pgpkeys.pca.dfn.de/pks/lookup?search=0xBB9D4887&op=get
# gpg --recv-keys --keyserver hkp://subkeys.pgp.net 0xBB9D4887
 
Old 09-10-2010, 03:10 PM
Paul Hartman
 
Default Pipe Lines - A really basic question

On Thu, Sep 9, 2010 at 3:46 PM, Daniel Troeder <daniel@admin-box.com> wrote:
> On 09/09/2010 07:24 PM, Matt Neimeyer wrote:
>> My generic question is: When I'm using a pipe line series of commands
>> do I use up more/less space than doing things in sequence?
>>
>> For example, I have a development Gentoo VM that has a hard drive that
>> is too small... I wanted to move a database off of that onto another
>> machine but when I tried the following I filled my partition and 'evil
>> things' happened...
>>
>> mysqldump blah...
>> gzip blah...
>>
>> In this specific case I added another virtual drive, mounted that and
>> went on with life but I'm curious if I could have gotten away with the
>> pipe line instead. Will doing something like this still use "twice"
>> the space?
>>
>> mysqldump | gzip > file.sql.gz
>>
>> OR going back to my generic question if I pipe line like "type | sort
>> | unique > output" does that only use 1x or 3x the disk space?
>>
>> Thanks in advance!
>>
>> Matt
>>
>> P.S. If the answer is "it depends" how do know what it depends on?
>>
> Everyone already answered the disk space question. I want to add just
> this: It also saves you lots of i/o-bandwidth: only the compressed data
> gets written to disk. As i/o is the most common bottleneck, it is often
> an imperative to do as much as possible in a pipe. If you're lucky it
> can also mean, that multiple programs run at the same time, resulting in
> higher throughput. Lucky is, when consumer and producer (right and left
> of pipe) can work simultaneously because the buffer is big enough. You
> can see this every time you (un)pack a tar.gz.

And if you have a huge amount of data where compression causes CPU to
become the bottleneck you can use something like pbzip2 which uses all
CPUs/cores in parallel to speed up [de]compression.
 
Old 09-10-2010, 03:22 PM
Matt Neimeyer
 
Default Pipe Lines - A really basic question

Thanks all for your help! I knew it was something simple I "should" have known.

Matt

On Thu, Sep 9, 2010 at 4:46 PM, Daniel Troeder <daniel@admin-box.com> wrote:
> On 09/09/2010 07:24 PM, Matt Neimeyer wrote:
>> My generic question is: When I'm using a pipe line series of commands
>> do I use up more/less space than doing things in sequence?
 
Old 09-10-2010, 04:34 PM
Florian Philipp
 
Default Pipe Lines - A really basic question

Am 09.09.2010 22:28, schrieb Grant Edwards:
> On 2010-09-09, Florian Philipp <lists@f_philipp.fastmail.net> wrote:
>
>> When you look closer at `sort`, it is actually a quite impressive
>> tool. It sorts in-memory for small amounts of data and switches to
>> temporary files for larger. It can even compress those files to save
>> disk space.
>>
>> And it is still faster than most "business-grade" software for
>> importing data into data warehouses.
>>
>> Throw `cut`, `paste`, `join` and `grep` into the mix and you can
>> build your own relational database system based on shell scripts
>
> Sort of linke /rdb: http://www.rdb.com/
>

Interesting. I've just read the paper they have posted.

You know what I'd really like to do? Build a graphical dataflow-centric
programming language for generating shell scripts. Since dataflows are
the real strength of shells, I figure it would be a neat tool for
improving more complex tasks. Usually I resort to temporary files when
stuff gets more complicated than a simple sequential pipe. That really
hurts performance. A more abstract representation could really help in
those situations.

Well, I figure someone has already done this with Eclipse GMF or
something like that and I just don't know it. Well, whatever. Nice to
know such stuff exists, though.

Thanks for the pointer
 

Thread Tools




All times are GMT. The time now is 03:04 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org