FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo Development

 
 
LinkBack Thread Tools
 
Old 12-05-2011, 06:10 AM
Alec Warner
 
Default sources.gentoo.org instability

Hello,

For a while sources.gentoo.org has been puttering along and its health
has slowly declined. We migrated it to some newer shiny hardware in an
attempt to mitigate the problem but that did not pan out. 90% (or
more) of sources.gentoo.org traffic is crawler bots and not actual
humans. That being said; if we cannot serve requests to the bots
within our timeouts we serve 500's instead which is never really what
we want (particularly when we spent 20s of CPU to calculate 80% of the
response only to see the client timeout :/.)

The majority of the expensive requests are related to package.mask and
use.local.desc queries by crawlers. Like crawling the entire 13000 rev
history for package.mask (or similar.)

While it is likely we will monkey patch viewvc to be less wasteful; in
the meantime I have removed use.local.desc from sources.gentoo.org
(and also anoncvs, because they share the same repo.) I hope this is a
short term (order of weeks) hack.

-A
 
Old 12-05-2011, 10:48 AM
"Andreas K. Huettel"
 
Default sources.gentoo.org instability

Seriously, what do we gain from crawlers accessing sources.gentoo.org? I cant
really remember seeing it once in a google query result...

Possibly it would not even be required to deny all requests, but just deny
everything related to ancient history...

> Hello,
>
> For a while sources.gentoo.org has been puttering along and its health
> has slowly declined. We migrated it to some newer shiny hardware in an
> attempt to mitigate the problem but that did not pan out. 90% (or
> more) of sources.gentoo.org traffic is crawler bots and not actual
> humans. That being said; if we cannot serve requests to the bots
> within our timeouts we serve 500's instead which is never really what
> we want (particularly when we spent 20s of CPU to calculate 80% of the
> response only to see the client timeout :/.)
>
> The majority of the expensive requests are related to package.mask and
> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
> history for package.mask (or similar.)
>
> While it is likely we will monkey patch viewvc to be less wasteful; in
> the meantime I have removed use.local.desc from sources.gentoo.org
> (and also anoncvs, because they share the same repo.) I hope this is a
> short term (order of weeks) hack.
>
> -A

--
Andreas K. Huettel
Gentoo Linux developer
kde, sci, arm, tex, printing
 
Old 12-05-2011, 03:27 PM
Alec Warner
 
Default sources.gentoo.org instability

On Mon, Dec 5, 2011 at 3:48 AM, Andreas K. Huettel <dilfridge@gentoo.org> wrote:
>
> Seriously, what do we gain from crawlers accessing sources.gentoo.org? *I cant
> really remember seeing it once in a google query result...

We want the site searchable.

>
> Possibly it would not even be required to deny all requests, but just deny
> everything related to ancient history...
>
>> Hello,
>>
>> For a while sources.gentoo.org has been puttering along and its health
>> has slowly declined. We migrated it to some newer shiny hardware in an
>> attempt to mitigate the problem but that did not pan out. 90% (or
>> more) of sources.gentoo.org traffic is crawler bots and not actual
>> humans. That being said; if we cannot serve requests to the bots
>> within our timeouts we serve 500's instead which is never really what
>> we want (particularly when we spent 20s of CPU to calculate 80% of the
>> response only to see the client timeout :/.)
>>
>> The majority of the expensive requests are related to package.mask and
>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>> history for package.mask (or similar.)
>>
>> While it is likely we will monkey patch viewvc to be less wasteful; in
>> the meantime I have removed use.local.desc from sources.gentoo.org
>> (and also anoncvs, because they share the same repo.) I hope this is a
>> short term (order of weeks) hack.
>>
>> -A
>
> --
> Andreas K. Huettel
> Gentoo Linux developer
> kde, sci, arm, tex, printing
>
>
 
Old 12-09-2011, 04:30 AM
Alec Warner
 
Default sources.gentoo.org instability

2011/12/5 Ch*-Thanh Christopher Nguyễn <chithanh@gentoo.org>:
> Alec Warner schrieb:
>>> Seriously, what do we gain from crawlers accessing sources.gentoo.org? *I cant
>>> really remember seeing it once in a google query result...
>>
>> We want the site searchable.
>
>>>> The majority of the expensive requests are related to package.mask and
>>>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>>>> history for package.mask (or similar.)
>
> Would it be feasible to use mod_rewrite to direct the most expensive
> requests to a static copy, which is re-generated every
> ${REASONABLE_TIMEFRAME}?

For now user-agents that look like a bot get sent to
sources2.gentoo.org (via HTTP-302, not a perm redirect) and humans are
good on sources.gentoo.org. Assuming the crawlers and indexing systems
follow the spec; hopefully all our search resutls do not get rewritten
to sources2.gentoo.org (that would surprise me greatly...wait no it
wouldn't ;p)

Robin added a caching layer for some segments of the application; I am
looking at cprofile dumps and discussing pain points with upstream.

-A

>
>
> Best regards,
> Ch*-Thanh Christopher Nguyễn
>
 

Thread Tools




All times are GMT. The time now is 08:23 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org