FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Ubuntu > Kubuntu User

 
 
LinkBack Thread Tools
 
Old 01-17-2008, 01:30 PM
Ant Cunningham
 
Default Site Spider

Does anyone know of any good HTTP spider/ripper utilities that will preserve
file structure and names?

thanx!



--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-17-2008, 03:14 PM
Paul Lemmons
 
Default Site Spider

-------- Original Message --------
Subject: Site Spider
From: Ant Cunningham <prodigitalson@vectrbas-d.com>
To: kubuntu-users <kubuntu-users@lists.ubuntu.com>
Date: 01/17/2008 07:30 AM

> Does anyone know of any good HTTP spider/ripper utilities that will preserve
> file structure and names?
>
> thanx!
>
>
>

wget

--
Sometimes I wonder. Were our faith able to stand upright and look
around, would it be looking down at the mustard seed or standing in awe
of the height and breadth of it.

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-17-2008, 03:24 PM
Ant Cunningham
 
Default Site Spider

On 1/17/08 11:14 AM, "Paul Lemmons" <paul@lemmons.name> wrote:

> -------- Original Message --------
> Subject: Site Spider
> From: Ant Cunningham <prodigitalson@vectrbas-d.com>
> To: kubuntu-users <kubuntu-users@lists.ubuntu.com>
> Date: 01/17/2008 07:30 AM
>
>> Does anyone know of any good HTTP spider/ripper utilities that will preserve
>> file structure and names?
>>
>> thanx!
>>
>>
>>
>
> wget

granted I havent read the man page for all the options but I assumed wget
worked similar to curl in that you had to supply each url. I don't know the
url of every page and I don't feel like writing a spider myself - that's why
I asked.

will wget spider through the site for me and grab everything - or better
yet, everything I don't exclude with some option?



--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-17-2008, 03:56 PM
Paul Lemmons
 
Default Site Spider

-------- Original Message --------
Subject: Re:Site Spider
From: Ant Cunningham <prodigitalson@vectrbas-d.com>
To: kubuntu-users <kubuntu-users@lists.ubuntu.com>
Date: 01/17/2008 09:24 AM
> On 1/17/08 11:14 AM, "Paul Lemmons" <paul@lemmons.name> wrote:
>
>
>> -------- Original Message --------
>> Subject: Site Spider
>> From: Ant Cunningham <prodigitalson@vectrbas-d.com>
>> To: kubuntu-users <kubuntu-users@lists.ubuntu.com>
>> Date: 01/17/2008 07:30 AM
>>
>>
>>> Does anyone know of any good HTTP spider/ripper utilities that will preserve
>>> file structure and names?
>>>
>>> thanx!
>>>
>>>
>>>
>>>
>> wget
>>
>
> granted I havent read the man page for all the options but I assumed wget
> worked similar to curl in that you had to supply each url. I don't know the
> url of every page and I don't feel like writing a spider myself - that's why
> I asked.
>
> will wget spider through the site for me and grab everything - or better
> yet, everything I don't exclude with some option?
>
>
>
>
wget -rc http://www.your-site.com

--
Sometimes I wonder. Were our faith able to stand upright and look around, would it be looking down at the mustard seed or standing in awe of the height and breadth of it.


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-17-2008, 04:00 PM
Paul Lemmons
 
Default Site Spider

-------- Original Message --------
Subject: Re:Site Spider
From: Ant Cunningham <prodigitalson@vectrbas-d.com>
To: kubuntu-users <kubuntu-users@lists.ubuntu.com>
Date: 01/17/2008 09:24 AM
> On 1/17/08 11:14 AM, "Paul Lemmons" <paul@lemmons.name> wrote:
>
>
>> -------- Original Message --------
>> Subject: Site Spider
>> From: Ant Cunningham <prodigitalson@vectrbas-d.com>
>> To: kubuntu-users <kubuntu-users@lists.ubuntu.com>
>> Date: 01/17/2008 07:30 AM
>>
>>
>>> Does anyone know of any good HTTP spider/ripper utilities that will preserve
>>> file structure and names?
>>>
>>> thanx!
>>>
>>>
>>>
>>>
>> wget
>>
>
> granted I havent read the man page for all the options but I assumed wget
> worked similar to curl in that you had to supply each url. I don't know the
> url of every page and I don't feel like writing a spider myself - that's why
> I asked.
>
> will wget spider through the site for me and grab everything - or better
> yet, everything I don't exclude with some option?
>
>
>
>
Ooops... missed the "exclude" requirement...

wget -rc -X/not/this/dir,/and/not/this/one http://www.your-site.com


--
Sometimes I wonder. Were our faith able to stand upright and look around, would it be looking down at the mustard seed or standing in awe of the height and breadth of it.


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-19-2008, 07:53 PM
Wulfy
 
Default Site Spider

Paul Lemmons wrote:
> wget -rc http://www.your-site.com

If you started it at a folder below / on the site, would it just get
everything under that folder? Or would it climb the tree as well as descend?

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-19-2008, 09:16 PM
Paul Lemmons
 
Default Site Spider

Wulfy wrote:

Paul Lemmons wrote:


wget -rc http://www.your-site.com



If you started it at a folder below / on the site, would it just get
everything under that folder? Or would it climb the tree as well as descend?


For it to "spider" through, it will open the initial page (usually
index.html, index.php or default.htm) and then follow the links to new
pages and then follow their links and so on and so forth until it has
the site.



If you want to create a complete backup, including files that are not
linked to, you will want to use the ftp protocol instead of http.



wget -rc ftp://useridassword@www.your-site.com

wget --help gives you some help remembering the options. "man wget" gives you a lot more detail.Googling will turn up lots of examples.




--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-20-2008, 12:30 AM
Wulfy
 
Default Site Spider

Paul Lemmons wrote:
> For it to "spider" through, it will open the initial page (usually
> index.html, index.php or default.htm) and then follow the links to new
> pages and then follow their links and so on and so forth until it has
> the site.
>
> If you want to create a complete backup, including files that are not
> linked to, you will want to use the ftp protocol instead of http.
>
> wget -rc ftp://useridassword@www.your-site.com
>
> wget --help gives you some help remembering the options. "man wget" gives you a lot more detail.Googling will turn up lots of examples.
>
I decided to try it anyway before I got your answer.

wget -rc with the http protocol downloads the index.html and the
robots.txt files... It doesn't go on from there.

since you suggested using the ftp protocol, I tried again. It doesn't
even find the site... can't change to the directory.

Let me explain what I want to do. There is an archived copy of a
website that is no longer there at web.archive.org. I want to retrieve
the data from that site. I can go through it, page by page, and save as
I go. But I thought that seemed a bit long-winded when wget could grab
the lot in one command. (I also tried it on a live site and got the
same results.)

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz


--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-20-2008, 05:30 AM
Donn
 
Default Site Spider

There is a gui that does this. It has a name so abysmal that I can't recall
it...

I used this scripts once a few years ago to fetch a website.
It gets two parameters: url level
The level is how far down a chain of links it should go.
You could just replace the vars and run the command directly.
===

#!/bin/bash
#Try to make using wget easier than it bloody is.
url=$1
if [ -z $url ]; then (echo "Bad url"; exit 1); fi
LEV=$2
if [ -z $LEV ]; then
LEV="2"
fi

echo "running: wget --convert-links -r -l$LEV $url -o log"
wget --convert-links -r -l$LEV "$url" -o log

===

man wget is the best plan really.


d

--
Gee, what a terrific party. Later on we'll get some fluid and embalm each
other. -- Neil Simon

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 
Old 01-20-2008, 07:15 AM
Wulfy
 
Default Site Spider

[I sent this to Donn's private e-mail by mistake.. sorry Donn.]

Donn wrote:
> There is a gui that does this. It has a name so abysmal that I can't recall
> it...
>
> I used this scripts once a few years ago to fetch a website.
> It gets two parameters: url level
> The level is how far down a chain of links it should go.
> You could just replace the vars and run the command directly.
> ===
>
> #!/bin/bash
> #Try to make using wget easier than it bloody is.
> url=$1
> if [ -z $url ]; then (echo "Bad url"; exit 1); fi
> LEV=$2
> if [ -z $LEV ]; then
> LEV="2"
> fi
>
> echo "running: wget --convert-links -r -l$LEV $url -o log"
> wget --convert-links -r -l$LEV "$url" -o log
>
> ===
>
> man wget is the best plan really.
>
>
> d
>
>
<sigh> I don't know what I'm doing wrong, but I can't get wget to get
more than the top layer of the site. The archive.org site just brings
in index.html (and robots.txt). I tried it on another site and it
brought in the two versions of the main page (dialup and high speed) but
the menu links weren't followed. I tried -l5 and -15 and got the same
download.

Any idea why the -r isn't recursing?

--
Blessings

Wulfmann

Wulf Credo:
Respect the elders. Teach the young. Co-operate with the pack.
Play when you can. Hunt when you must. Rest in between.
Share your affections. Voice your opinion. Leave your Mark.
Copyright July 17, 1988 by Del Goetz



--
kubuntu-users mailing list
kubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/kubuntu-users
 

Thread Tools




All times are GMT. The time now is 08:09 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org