FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Ubuntu > Ubuntu User

 
 
LinkBack Thread Tools
 
Old 08-10-2011, 04:47 PM
Hal Burgiss
 
Default Scripting / one liner help

On Wed, Aug 10, 2011 at 12:29 PM, Patton Echols <p.echols@comcast.net> wrote:

I am looking for thoughts on how I might extract image names from an html document.



The document started as a Word document with nothing but images, one per page, randomly named. *It was saved as html using libre office, so I now have the images separate. *I have a script that will process them through imagemagik to clean them up, reduce to from full color to b/w and make them into a pdf. *But the pages are out of order because the images are randomly named.




What I'd like to do is have something read the html file in order and either feed the names of the JPGs to the script in order or just spit them out to a file that I can feed to the script. *The html source has all the images listed sequentially without line breaks. *Each tag is the same except for the image name and looks like this:


<IMG SRC="">jpg" NAME="graphics3" ALIGN=BOTTOM WIDTH=575 HEIGHT=790 BORDER=0>



See if this gets close to extracting the image names ...*
grep SRC *html | sed -r 's/SRC="" | whatever_script.sh


--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-10-2011, 06:42 PM
Patton Echols
 
Default Scripting / one liner help

On 08/10/2011 09:47 AM, Hal Burgiss wrote:
On Wed, Aug 10, 2011 at 12:29 PM, Patton Echols <p.echols@comcast.net
<mailto.echols@comcast.net>> wrote:


I am looking for thoughts on how I might extract image names from
an html document.

The document started as a Word document with nothing but images,
one per page, randomly named. It was saved as html using libre
office, so I now have the images separate. I have a script that
will process them through imagemagik to clean them up, reduce to
from full color to b/w and make them into a pdf. But the pages
are out of order because the images are randomly named.

What I'd like to do is have something read the html file in order
and either feed the names of the JPGs to the script in order or
just spit them out to a file that I can feed to the script. The
html source has all the images listed sequentially without line
breaks. Each tag is the same except for the image name and looks
like this:
<IMG SRC="source_html_m1463afff.jpg" NAME="graphics3" ALIGN=BOTTOM
WIDTH=575 HEIGHT=790 BORDER=0>


See if this gets close to extracting the image names ...

grep SRC *html | sed -r 's/SRC="([^"]+)"/1/ig' | whatever_script.sh




Thanks Hal,

my script starts with "for i in *jpg" and then works each file
individually. So I tried that line without the pipe to
whatever_script.sh, hoping for a list of files to be output to the
terminal. That seemed to output the string of tags but without the
double quotes around the image names. Is that what it should have done?


Thanks

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-10-2011, 07:00 PM
Johnny Rosenberg
 
Default Scripting / one liner help

2011/8/10 Hal Burgiss <hal@burgiss.net>:
> On Wed, Aug 10, 2011 at 12:29 PM, Patton Echols <p.echols@comcast.net>
> wrote:
>>
>> I am looking for thoughts on how I might extract image names from an html
>> document.
>>
>> The document started as a Word document with nothing but images, one per
>> page, randomly named. *It was saved as html using libre office, so I now
>> have the images separate. *I have a script that will process them through
>> imagemagik to clean them up, reduce to from full color to b/w and make them
>> into a pdf. *But the pages are out of order because the images are randomly
>> named.
>>
>> What I'd like to do is have something read the html file in order and
>> either feed the names of the JPGs to the script in order or just spit them
>> out to a file that I can feed to the script. *The html source has all the
>> images listed sequentially without line breaks. *Each tag is the same except
>> for the image name and looks like this:
>> <IMG SRC="source_html_m1463afff.jpg" NAME="graphics3" ALIGN=BOTTOM
>> WIDTH=575 HEIGHT=790 BORDER=0>
>>
>
> See if this gets close to extracting the image names ...
> grep SRC *html | sed -r 's/SRC="([^"]+)"/1/ig' | whatever_script.sh

I didn't create this thread, but can you please explain that sed
statement? I don't get it… (I'm not a beginner with regular
expressions but I'm definitely not an expert either…)


Kind regards

Johnny Rosenberg
ジョニー・*ーゼンバーグ

>
> --
> Hal
>
> --
> ubuntu-users mailing list
> ubuntu-users@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
>
>

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-10-2011, 09:46 PM
Hal Burgiss
 
Default Scripting / one liner help

On Wed, Aug 10, 2011 at 2:42 PM, Patton Echols <p.echols@comcast.net> wrote:

On 08/10/2011 09:47 AM, Hal Burgiss wrote:



See if this gets close to extracting the image names ...



grep SRC *html | sed -r 's/SRC="" | whatever_script.sh








Thanks Hal,



my script starts with "for i in *jpg" and then works each file individually. *So I tried that line without the pipe to whatever_script.sh, hoping for a list of files to be output to the terminal. *That seemed to output the string of tags but without the double quotes around the image names. *Is that what it should have done?




That was what I was getting at, yes, the list of image filenames.
If you want just file names, you might try:*
*grep -H SRC *hmtl | sed 's/:.*//' |sort|uniq*

or something along those lines. (Completely untested)
--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-10-2011, 09:52 PM
Hal Burgiss
 
Default Scripting / one liner help

On Wed, Aug 10, 2011 at 3:00 PM, Johnny Rosenberg <gurus.knugum@gmail.com> wrote:

2011/8/10 Hal Burgiss <hal@burgiss.net>:

>

> See if this gets close to extracting the image names ...

> grep SRC *html | sed -r 's/SRC="" | whatever_script.sh



I didn't create this thread, but can you please explain that sed

statement? I don't get it… (I'm not a beginner with regular

expressions but I'm definitely not an expert either…)


Its attempting to capture the string in between:
*SRC="" *and the next doublequote: ". *The [^"] stops the capture at the next double quote. The capture should then include any character that is NOT a double quote. If not careful, the _expression_ could get "greedy" and start matching other double quotes on the same line. *This should stop that effect. The 1 is a reference back to the capture that is in the parenthesis, in sed syntax, which essentially just preserves the captured characters, and ignores the rest. Does that make sense?

--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-10-2011, 10:43 PM
"Jordon Bedwell"
 
Default Scripting / one liner help

On Wed, August 10, 2011 2:52 pm, Hal Burgiss wrote:
> Its attempting to capture the string in between:
>
> SRC=" and the next doublequote: ". The [^"] stops the capture at the
> double quote. The capture should then include any character that is NOT a
> double quote. If not careful, the expression could get "greedy" and start
> matching other double quotes on the same line. This should stop that
> effect. The 1 is a reference back to the capture that is in the
> parenthesis, in sed syntax, which essentially just preserves the captured
> characters, and ignores the rest. Does that make sense?

Because it should be:

grep -iPo "<img[^>]+>" file.html |
sed -n 's/<img src=['"]([^"']*).*/1/pgI'

[COPY AND PASTE BOTH LINES AT ONCE AND PRESS THE ENTER KEY]


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-10-2011, 11:58 PM
Patton Echols
 
Default Scripting / one liner help

On 08/10/2011 02:52 PM, Hal Burgiss wrote:


On Wed, Aug 10, 2011 at 3:00 PM, Johnny Rosenberg
<gurus.knugum@gmail.com <mailto:gurus.knugum@gmail.com>> wrote:


2011/8/10 Hal Burgiss <hal@burgiss.net <mailto:hal@burgiss.net>>:
>
> See if this gets close to extracting the image names ...
> grep SRC *html | sed -r 's/SRC="([^"]+)"/1/ig' | whatever_script.sh

I didn't create this thread, but can you please explain that sed
statement? I don't get it… (I'm not a beginner with regular
expressions but I'm definitely not an expert either…)


Its attempting to capture the string in between:

SRC=" and the next doublequote: ". The [^"] stops the capture at
the next double quote. The capture should then include any character
that is NOT a double quote. If not careful, the expression could get
"greedy" and start matching other double quotes on the same line.
This should stop that effect. The 1 is a reference back to the
capture that is in the parenthesis, in sed syntax, which essentially
just preserves the captured characters, and ignores the rest. Does
that make sense?


--
Hal


Thanks for the explanation Hal, unfortunately it is not doing the
"ignores the rest" part It appears that it finds each occurrance of a
file name, then replaces it with the same occurrance, without the " marks.




--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-11-2011, 12:06 AM
Patton Echols
 
Default Scripting / one liner help

On 08/10/2011 03:43 PM, Jordon Bedwell wrote:

On Wed, August 10, 2011 2:52 pm, Hal Burgiss wrote:

Its attempting to capture the string in between:

SRC=" and the next doublequote: ". The [^"] stops the capture at the
double quote. The capture should then include any character that is NOT a
double quote. If not careful, the expression could get "greedy" and start
matching other double quotes on the same line. This should stop that
effect. The 1 is a reference back to the capture that is in the
parenthesis, in sed syntax, which essentially just preserves the captured
characters, and ignores the rest. Does that make sense?

Because it should be:

grep -iPo "<img[^>]+>" file.html |
sed -n 's/<img src=['"]([^"']*).*/1/pgI'

[COPY AND PASTE BOTH LINES AT ONCE AND PRESS THE ENTER KEY]


Thanks, that works great and solves the immediate problem. For purposes
of my CLE (continuing linux education) I hope you will indulge me in the
same question you posed to Hal. How's it work? I get the -io grep
tags. The -P enables perl regex? What part of the grep string is the
perl part?


Then I also wonder how the sed statement works. I am still trying to
figure sed (and plain old regex) out.


Even if you don't have time for the follow up, I appreciate it.

-- PE

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-11-2011, 01:30 AM
Hal Burgiss
 
Default Scripting / one liner help

On Wed, Aug 10, 2011 at 7:58 PM, Patton Echols <p.echols@comcast.net> wrote:

On 08/10/2011 02:52 PM, Hal Burgiss wrote:




On Wed, Aug 10, 2011 at 3:00 PM, Johnny Rosenberg <gurus.knugum@gmail.com <mailto:gurus.knugum@gmail.com>> wrote:




* *2011/8/10 Hal Burgiss <hal@burgiss.net <mailto:hal@burgiss.net>>:

* *>

* *> See if this gets close to extracting the image names ...

* *> grep SRC *html | sed -r 's/SRC="" | whatever_script.sh



* *I didn't create this thread, but can you please explain that sed

* *statement? I don't get it… (I'm not a beginner with regular

* *expressions but I'm definitely not an expert either…)




Thanks for the explanation Hal, unfortunately it is not doing the "ignores the rest" part It appears that it finds each occurrance of a file name, then replaces it with the same occurrance, without the " marks.



Sorry something got left out, try ... *
* grep *SRC *html | sed -r 's/.*SRC="">


--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-11-2011, 01:44 AM
"Jordon Bedwell"
 
Default Scripting / one liner help

On Wed, August 10, 2011 5:06 pm, Patton Echols wrote:
> On 08/10/2011 03:43 PM, Jordon Bedwell wrote:
>> On Wed, August 10, 2011 2:52 pm, Hal Burgiss wrote:
>>> Its attempting to capture the string in between:
>>>
>>> SRC=" and the next doublequote: ". The [^"] stops the capture at the
>>> double quote. The capture should then include any character that is NOT
>>> a
>>> double quote. If not careful, the expression could get "greedy" and
>>> start
>>> matching other double quotes on the same line. This should stop that
>>> effect. The 1 is a reference back to the capture that is in the
>>> parenthesis, in sed syntax, which essentially just preserves the
>>> captured
>>> characters, and ignores the rest. Does that make sense?
>> Because it should be:
>>
>> grep -iPo "<img[^>]+>" file.html |
>> sed -n 's/<img src=['"]([^"']*).*/1/pgI'
>>
>> [COPY AND PASTE BOTH LINES AT ONCE AND PRESS THE ENTER KEY]
>
> Thanks, that works great and solves the immediate problem. For purposes
> of my CLE (continuing linux education) I hope you will indulge me in the
> same question you posed to Hal. How's it work? I get the -io grep
> tags. The -P enables perl regex? What part of the grep string is the
> perl part.

BRE: grep -io "<img[^>]+>" index.html. I chose Perl syntax by habit, not
by need. So to answer your question the "+", for this, Perl and ERE are
the same. It won't be till later when you start doing some hardcore
regexps you see the differ between ERE and Perl and others.

> Then I also wonder how the sed statement works. I am still trying to
> figure sed (and plain old regex) out.

' is a bash escape for ' so you should read it without '. It's a BRE
so think ( is ( in ERE or Perl syntax. /g tells it to do it globally, not
only act on the first instance it finds and exit and /I tells it to ignore
the case. 1 (
) is a backreference which is should have been one of the
first things you learnt about Regexp's.

Now on to the rest of it:
sed 's/<img src=['"]([^"']*).*/1/gI
sed -n 's/<img src=['"]([^"']*).*/1/pgI

At this point, for you, these two are the same and a preference by choice,
the latter being of my own preference the former being chosen by whoever
likes it. They both do the same thing right now for you on your usage.
In later applications where more advanced things happen you will start to
notice the differences. To elaborate this:

*IF index.html was a FULL HTML page*
*THEN: sed -n 's/<img src=['"]([^"']*).*/1/pgI' 1.html > 1.txt
*IS:* image.jpg [Assuming <img /> is on it's own line with no wrappers]
*AND:* sed 's/<img src=['"]([^"']*).*/1/Ig' 1.html > 1.txt
*IS:* the same index.html page with those changes done in place.

Since I'm horrible at teaching, in other words the first with -n /p will
only show the backreferences in that example and the second will replace
those lines in the file leaving everything else intact. Do them both on
your file with > filename.txt and you will see what I mean instantly.

Somebody else might be better at explaining, I am a doer and and outputter
not really a teacher, I can show you how to do a lot but when it comes to
explaining how I did it you're barking up the wrong tree because to me it
comes out as pro English, to you it comes out as jibberish. To me it
comes out as this is how it's done and to you it comes out as "what the
hell did he just say? he pretty much just said by voice the command and
gave no explanation of what it does" <<< Plenty have said that one to me.


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 

Thread Tools




All times are GMT. The time now is 07:55 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org