FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 02-03-2009, 04:40 PM
"Boyd Stephen Smith Jr."
 
Default Slow Script

On Tuesday 03 February 2009 11:24:37 Dave Sherohman wrote:
> Given the small piece of code that you posted and the magnitude of the
> numbers you've stated, I strongly suspect that you probably want to use
> a database for this,

Or, at the very least, a HashTable, Trie, or SearchTree.
--
Boyd Stephen Smith Jr. ,= ,-_-. =.
bss@iguanasuicide.net ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-'
http://iguanasuicide.net/ \_/
 
Old 02-03-2009, 05:27 PM
"Stackpole, Chris"
 
Default Slow Script

> From: Dave Sherohman [mailto:dave@sherohman.org]
> Sent: Tuesday, February 03, 2009 11:25 AM
> Subject: Re: Slow Script
>
> On Tue, Feb 03, 2009 at 06:14:48PM +0100, Gorka wrote:
> > Hi! I've got a perl script with this for:
> >
> > for (my $j=0;$j<=$#fichero1;$j++)
> > {
> > if (@fichero1[$j] eq $valor1)
> > {
> > $token = 1;
> > }
> > }
> >
> > The problem is that fichero1 has 32 millions of records and moreover
> I've
> > got to repeat this for several millions times, so this way it would
take
> > years to finish.
> > Does anybody know a way to optimize this script? Is there any other
> linux
> > programing language I could make this more quickly whith?
> > Thank you!
>
> Although the Perl could definitely be optimized (and you've already
been
> shown one way to do so), your core issue is that you're doing several
> million passes over 32 million records. That's not going to be fast
in
> any language. (Even if you can check a million records per second,
> that's 32 seconds per pass, or about 6 hours for 1,000 passes, or just
> over a year for a million passes.)
[snip]

I was just thinking that as well. Does the OP have multiple boxes he can
run this on? This could easily break down into a parallel process either
by manual or programmatic assignment. Splitting up the parallel task is
pretty easy; Google even has a shell script for easy parallel processing
[1].

Of course there are a fair bit of If's in this. (If there are resources.
If the data can be split/shared easily. Ect Ect.)

If not, Dave's idea for a database is a good idea too.


~Stack~

[1] http://code.google.com/p/ppss/
Note: you will probably need to do a fair bit of tweaking for this but
the ideas are what will be most useful to you anyway.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 02-03-2009, 06:54 PM
Joel Roth
 
Default Slow Script

On Tue, Feb 03, 2009 at 06:14:48PM +0100, Gorka wrote:
> Hi! I've got a perl script with this for:
>
> for (my $j=0;$j<=$#fichero1;$j++)
> {
> if (@fichero1[$j] eq $valor1)
>
if ($fichero1[$j] eq $valor1)
^^^

This is a beginner's mistake. You should
use warnings, i.e. run perl with the -w flag,
or put 'use warnings;' at the top of your script.
That will catch this mistake, and probably many
others that you will make, too.


> {
> $token = 1;
> }
> }

Seems like you will be better off improving your algorithm
rather than trying to get this code to run faster.

For the several million tests part you might use
a hash (dictionary) approach for dispatch rather than separate
tests as already suggested. That way you can handle
your test-and-act with one calculation instead of millions
of comparisons.

If you have this large a problem, it may be worth posting
a fuller description of the larger problem you want to
solve at perlmonks.org or another programming-oriented forum.

> The problem is that fichero1 has 32 millions of records and moreover I've
> got to repeat this for several millions times, so this way it would take
> years to finish.
> Does anybody know a way to optimize this script? Is there any other linux
> programing language I could make this more quickly whith?
> Thank you!

--
Joel Roth


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 02-04-2009, 01:02 AM
Chris Jones
 
Default Slow Script

On Tue, Feb 03, 2009 at 12:14:48PM EST, Gorka wrote:
> Hi! I've got a perl script with this for:
>
> for (my $j=0;$j<=$#fichero1;$j++)
> {
> if (@fichero1[$j] eq $valor1)
> {
> $token = 1;
> }
> }

> The problem is that fichero1 has 32 millions of records and moreover
> I've got to repeat this for several millions times, so this way it
> would take years to finish. Does anybody know a way to optimize this
> script? Is there any other linux programing language I could make this
> more quickly whith?

Since I can't imagine you need this on your home machine, I would talk
to my boss ... recommend an IBM mainframe running z/OS and a consultant
who will charge you $5000.00 to write three lines of JCL and optionally
ten lines of assembler that will emulate the above logic.

Contact me off-list if interested.

More seriouly, when you are dealing with 32 million records, one major
venue for optimization is to keep disk access to a minimum. Disk access
IIRC is measured in milliseconds, RAM access in nanoseconds and above..

Do the math..

The way to look at it is to make sure any logical record is transferred
from disk to RAM _once only_ (rather than a million times) and that each
disk access transfers as many records to central memory as the
filesystem (or rather the "access method" in mainframe parlance) and
hardware architecture allow - e.g. if for instance you set yourself up
so that your file's physical layout is such that each block contains
5,000 records and your access method (driver?) allows you to request 256
blocks to be transferred with one disk access, the same program will run
orders of magnitude faster than if each block contained one record and
you were reading one block at time because even though data transfer
times would be congruent, you would be skipping all the individual wait
times (head positioning and waiting while some unrelated process is
keeping the controller busy).

If you have to stick with an "Intel Inside" machine running linux, even
though neither the machine nor the OS were designed with this type of
work in mind, there are probably ways to keep disk access to a healthy
minimum but that's something I can't help you with.

Obviously, as others have suggested, this doesn't mean that you should
not _first_ look into why your logic dictates that you need to process
32 million records "several million times".

HTH

CJ


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 02-04-2009, 05:57 AM
Alex Samad
 
Default Slow Script

On Tue, Feb 03, 2009 at 09:02:52PM -0500, Chris Jones wrote:
> On Tue, Feb 03, 2009 at 12:14:48PM EST, Gorka wrote:
> > Hi! I've got a perl script with this for:
> >
> > for (my $j=0;$j<=$#fichero1;$j++)
> > {
> > if (@fichero1[$j] eq $valor1)
> > {
> > $token = 1;
> > }
> > }
>
> > The problem is that fichero1 has 32 millions of records and moreover
> > I've got to repeat this for several millions times, so this way it
> > would take years to finish. Does anybody know a way to optimize this
> > script? Is there any other linux programing language I could make this
> > more quickly whith?

[snip]

> More seriouly, when you are dealing with 32 million records, one major
> venue for optimization is to keep disk access to a minimum. Disk access
> IIRC is measured in milliseconds, RAM access in nanoseconds and above..
>
> Do the math..
[silly time]
32 * 4 = 128

so with 128M of memory he could hold 32 Million long int - I realise the
record is probably got more than int's so with 1G of spare ram he could
have 32 bytes per record.

[/silly time]



[snip]

>
> HTH
>
> CJ
>
>
> --
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
>
>

--
"It's a school full of so-called at-risk children. It's how we, unfortunately, label certain children. It means basically they can't learn. It's one of the best schools in Houston."

- George W. Bush
While speaking about KIPP Academy in Houston, TX
 
Old 02-04-2009, 10:17 AM
Dave Sherohman
 
Default Slow Script

On Tue, Feb 03, 2009 at 09:02:52PM -0500, Chris Jones wrote:
> More seriouly, when you are dealing with 32 million records, one major
> venue for optimization is to keep disk access to a minimum. Disk access
> IIRC is measured in milliseconds, RAM access in nanoseconds and above..
>
> Do the math..

Given that the posted loop is operating entirely on Perl in-memory
arrays, the OP is unlikely to be deliberately[1] accessing the disk
during this process.


[1] If it's a tied array, then it could have some magical disk
interaction behind it, but the OP doesn't appear to have reached a state
of Perl Enlightenment which would allow him to create or optimize magic
that deep. The other possibility for disk access would be if the
dataset is larger than available RAM and it's getting paged in and out
from disk, which is just bad news for performance no matter how you
slice it. Aside from those two cases, it looks very unlikely that I/O
would be the bottleneck here.

--
Dave Sherohman
NomadNet, Inc.
http://nomadnetinc.com/


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 02-04-2009, 11:14 AM
Steve Lamb
 
Default Slow Script

Dave Sherohman wrote:
> Given that the posted loop is operating entirely on Perl in-memory
> arrays, the OP is unlikely to be deliberately[1] accessing the disk
> during this process.

TBH given the fragment he posted there's no way to help him. There isn't
enough there to make any meaningful suggestions. Ya can't exactly squeeze
more speed out of Perl's for loop because, well, it only goes so fast and
without knowing what he's working with there's no sane way to suggest an
alternative which may be faster.

--
Steve C. Lamb | But who can decide what they dream
PGP Key: 1FC01004 | and dream I do
-------------------------------+---------------------------------------------
 
Old 02-05-2009, 12:44 AM
Chris Jones
 
Default Slow Script

On Wed, Feb 04, 2009 at 06:17:43AM EST, Dave Sherohman wrote:
> On Tue, Feb 03, 2009 at 09:02:52PM -0500, Chris Jones wrote:
> > More seriouly, when you are dealing with 32 million records, one major
> > venue for optimization is to keep disk access to a minimum. Disk access
> > IIRC is measured in milliseconds, RAM access in nanoseconds and above..
> >
> > Do the math..
>
> Given that the posted loop is operating entirely on Perl in-memory
> arrays, the OP is unlikely to be deliberately[1] accessing the disk
> during this process.
>
>
> [1] If it's a tied array, then it could have some magical disk
> interaction behind it, but the OP doesn't appear to have reached a state
> of Perl Enlightenment which would allow him to create or optimize magic
> that deep. The other possibility for disk access would be if the
> dataset is larger than available RAM..

Ay, there's the rub.

> ..and it's getting paged in and out from disk, which is just bad news
> for performance no matter how you slice it.

The worse possible scenario (as far as I understand it :-) because
you'll still have the I/O on the file _plus_ the I/O on your swap
partition/datasets _plus_ high system CPU usage due to the paging.

And the worst thing about this is that it is unpredictable.. there will
be times when it won't happen because enough memory is available.. and
other times when you get paged a four in the morning.

> Aside from those two cases, it looks very unlikely that I/O would be
> the bottleneck here.

Trust me. Whatever the machine or OS, when you're dealing with such
volumes, I/O alway ends up being part of the equation.

I don't know Perl. Thanks for the "tied array" hint.

CJ


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 02-05-2009, 12:45 AM
Chris Jones
 
Default Slow Script

On Wed, Feb 04, 2009 at 01:57:04AM EST, Alex Samad wrote:

> [silly time]

> 32 * 4 = 128
>
> so with 128M of memory he could hold 32 Million long int - I realise the
> record is probably got more than int's so with 1G of spare ram he could
> have 32 bytes per record.

Hmm.. 32 bytes records.. maybe a bare-bones ldap directory?

Naturally, with 300 bytes records, you need 10G of spare ram .. etc.

Another thing.. how do you guarantee your 1GB.. (10GB.. 100GB.. etc.) of
spare ram is available when your script starts running?

Because if not, wouldn't that send your system paging/swapping like
crazy with the benefit of trashing the system with added CPU overhead
the minute your script starts running?

Enlighten me!

> [/silly time]
>
> [snip]

> -- "It's a school full of so-called at-risk children. It's how we,
> unfortunately, label certain children. It means basically they can't
> learn. It's one of the best schools in Houston."
>
> - George W. Bush While speaking about KIPP Academy in Houston, TX

In the future, Samad, please refrain from posting stuff like the above
concerning my country's former President.

You and your pals may think it is funny but some of us around here may
find it quite offensive.

Thank you.

CJ


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 02-05-2009, 01:04 AM
Alex Samad
 
Default Slow Script

On Wed, Feb 04, 2009 at 08:45:35PM -0500, Chris Jones wrote:
> On Wed, Feb 04, 2009 at 01:57:04AM EST, Alex Samad wrote:
>
> > [silly time]
>
> > 32 * 4 = 128
> >
> > so with 128M of memory he could hold 32 Million long int - I realise the
> > record is probably got more than int's so with 1G of spare ram he could
> > have 32 bytes per record.
>
> Hmm.. 32 bytes records.. maybe a bare-bones ldap directory?

well he is only comparing int by the looks of (we are just guessing
until we get more info) so 4 long int records ...

>
> Naturally, with 300 bytes records, you need 10G of spare ram .. etc.
>
> Another thing.. how do you guarantee your 1GB.. (10GB.. 100GB.. etc.) of
> spare ram is available when your script starts running?
>
> Because if not, wouldn't that send your system paging/swapping like
> crazy with the benefit of trashing the system with added CPU overhead
> the minute your script starts running?
>
> Enlighten me!

well I have 8G in my home server - debian am64, i don't think having
1-2G free is unreasonable. It all depends.

>
> > [/silly time]
> >
> > [snip]
>
> > -- "It's a school full of so-called at-risk children. It's how we,
> > unfortunately, label certain children. It means basically they can't
> > learn. It's one of the best schools in Houston."
> >
> > - George W. Bush While speaking about KIPP Academy in Houston, TX
>
> In the future, Samad, please refrain from posting stuff like the above
> concerning my country's former President.

Hey its just fortune quoting bush, it real life.

>
> You and your pals may think it is funny but some of us around here may
> find it quite offensive.

Blame bush he said it not me ! I get quite a bit of kick out of them

>
> Thank you.
>
> CJ
>
>
> --
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
>
>

--
Trust everybody, but cut the cards.
-- Finlay Peter Dunne, "Mr. Dooley's Philosophy"
 

Thread Tools




All times are GMT. The time now is 03:04 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org