FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Ubuntu > Ubuntu User

 
 
LinkBack Thread Tools
 
Old 08-11-2011, 03:45 AM
Patton Echols
 
Default Scripting / one liner help

On 08/10/2011 06:44 PM, Jordon Bedwell wrote:

On Wed, August 10, 2011 5:06 pm, Patton Echols wrote:

On 08/10/2011 03:43 PM, Jordon Bedwell wrote:

On Wed, August 10, 2011 2:52 pm, Hal Burgiss wrote:

Its attempting to capture the string in between:

SRC=" and the next doublequote: ". The [^"] stops the capture at the
double quote. The capture should then include any character that is NOT
a
double quote. If not careful, the expression could get "greedy" and
start
matching other double quotes on the same line. This should stop that
effect. The 1 is a reference back to the capture that is in the
parenthesis, in sed syntax, which essentially just preserves the
captured
characters, and ignores the rest. Does that make sense?

Because it should be:

grep -iPo "<img[^>]+>" file.html |
sed -n 's/<img src=['"]([^"']*).*/1/pgI'

[COPY AND PASTE BOTH LINES AT ONCE AND PRESS THE ENTER KEY]

Thanks, that works great and solves the immediate problem. For purposes
of my CLE (continuing linux education) I hope you will indulge me in the
same question you posed to Hal. How's it work? I get the -io grep
tags. The -P enables perl regex? What part of the grep string is the
perl part.

BRE: grep -io "<img[^>]+>" index.html. I chose Perl syntax by habit, not
by need. So to answer your question the "+", for this, Perl and ERE are
the same. It won't be till later when you start doing some hardcore
regexps you see the differ between ERE and Perl and others.


Then I also wonder how the sed statement works. I am still trying to
figure sed (and plain old regex) out.

' is a bash escape for ' so you should read it without '. It's a BRE
so think ( is ( in ERE or Perl syntax. /g tells it to do it globally, not
only act on the first instance it finds and exit and /I tells it to ignore
the case. 1 (
) is a backreference which is should have been one of the
first things you learnt about Regexp's.

Now on to the rest of it:
sed 's/<img src=['"]([^"']*).*/1/gI
sed -n 's/<img src=['"]([^"']*).*/1/pgI

At this point, for you, these two are the same and a preference by choice,
the latter being of my own preference the former being chosen by whoever
likes it. They both do the same thing right now for you on your usage.
In later applications where more advanced things happen you will start to
notice the differences. To elaborate this:

*IF index.html was a FULL HTML page*
*THEN: sed -n 's/<img src=['"]([^"']*).*/1/pgI' 1.html> 1.txt
*IS:* image.jpg [Assuming<img /> is on it's own line with no wrappers]
*AND:* sed 's/<img src=['"]([^"']*).*/1/Ig' 1.html> 1.txt
*IS:* the same index.html page with those changes done in place.

Since I'm horrible at teaching, in other words the first with -n /p will
only show the backreferences in that example and the second will replace
those lines in the file leaving everything else intact. Do them both on
your file with> filename.txt and you will see what I mean instantly.

Somebody else might be better at explaining, I am a doer and and outputter
not really a teacher, I can show you how to do a lot but when it comes to
explaining how I did it you're barking up the wrong tree because to me it
comes out as pro English, to you it comes out as jibberish. To me it
comes out as this is how it's done and to you it comes out as "what the
hell did he just say? he pretty much just said by voice the command and
gave no explanation of what it does"<<< Plenty have said that one to me.


This is great. Thanks so much for taking the time. True, it is a
little opaque to me right now, but I also know how I learn. And there
is enough there that I can figure it out with some work. That's how I
really learn best. So I thank you.


--PE

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-11-2011, 08:35 AM
Johnny Rosenberg
 
Default Scripting / one liner help

2011/8/10 Hal Burgiss <hal@burgiss.net>:
>
> On Wed, Aug 10, 2011 at 3:00 PM, Johnny Rosenberg <gurus.knugum@gmail.com>
> wrote:
>>
>> 2011/8/10 Hal Burgiss <hal@burgiss.net>:
>> >
>> > See if this gets close to extracting the image names ...
>> > grep SRC *html | sed -r 's/SRC="([^"]+)"/1/ig' | whatever_script.sh
>>
>> I didn't create this thread, but can you please explain that sed
>> statement? I don't get it… (I'm not a beginner with regular
>> expressions but I'm definitely not an expert either…)
>>
>
> Its attempting to capture the string in between:
> *SRC=" *and the next doublequote: ". *The [^"] stops the capture at the next
> double quote. The capture should then include any character that is NOT a
> double quote. If not careful, the expression could get "greedy" and start
> matching other double quotes on the same line. *This should stop that
> effect. The 1 is a reference back to the capture that is in the
> parenthesis, in sed syntax, which essentially just preserves the captured
> characters, and ignores the rest. Does that make sense?

Aaaah…! Thanks! I always forget that ^ means NOT in some situations,
that happened to me before (I should learn some time, shouldn't I?)…!
I just didn't get it when I thought of ^ in its other meaning…


Best regards

Johnny Rosenberg
ジョニー・*ーゼンバーグ

> --
> Hal
>
> --
> ubuntu-users mailing list
> ubuntu-users@lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
>
>

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-11-2011, 09:05 PM
Patton Echols
 
Default Scripting / one liner help

On 08/10/2011 06:30 PM, Hal Burgiss wrote:
On Wed, Aug 10, 2011 at 7:58 PM, Patton Echols <p.echols@comcast.net
<mailto.echols@comcast.net>> wrote:


On 08/10/2011 02:52 PM, Hal Burgiss wrote:


On Wed, Aug 10, 2011 at 3:00 PM, Johnny Rosenberg
<gurus.knugum@gmail.com <mailto:gurus.knugum@gmail.com>
<mailto:gurus.knugum@gmail.com
<mailto:gurus.knugum@gmail.com>>> wrote:

2011/8/10 Hal Burgiss <hal@burgiss.net
<mailto:hal@burgiss.net> <mailto:hal@burgiss.net
<mailto:hal@burgiss.net>>>:

>
> See if this gets close to extracting the image names ...
> grep SRC *html | sed -r 's/SRC="([^"]+)"/1/ig' |
whatever_script.sh

I didn't create this thread, but can you please explain
that sed
statement? I don't get it… (I'm not a beginner with regular
expressions but I'm definitely not an expert either…)


Thanks for the explanation Hal, unfortunately it is not doing the
"ignores the rest" part It appears that it finds each occurrance
of a file name, then replaces it with the same occurrance, without
the " marks.


Sorry something got left out, try ...

grep SRC *html | sed -r 's/.*SRC="([^"]+)".*/1/ig'


--
Hal


As mentioned in another post to this thread, I have a working solution.
So this for info only.


The original source document has all the image tags on one line w/o
carriage return or newline. So the grep statement captures the whole
line. Then the modified sed statement outputs only the last image file
name.


Using the grep statement suggested by Johnny:

grep -io "<img[^>]+>"

solves it because grep is spitting out each match, not the entire line.

As I mentioned to Johnny, even though I don't understand all of this, the discussion is helping me learn, so I greatly appreciate it.

--PE


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-11-2011, 09:33 PM
Hal Burgiss
 
Default Scripting / one liner help

On Thu, Aug 11, 2011 at 5:05 PM, Patton Echols <p.echols@comcast.net> wrote:



The original source document has all the image tags on one line w/o carriage return or newline. *So the grep statement captures the whole line. *Then the modified sed statement outputs only the last image file name.



Yea sorry, I didn't read something right ... I thought there were multiple files with one img tag per file.*--
Hal

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-12-2011, 11:45 AM
Johnny Rosenberg
 
Default Scripting / one liner help

2011/8/11 Patton Echols <p.echols@comcast.net>:
> Using the grep statement suggested by Johnny:
>
> grep -io "<img[^>]+>"
>
> solves it because grep is spitting out each match, not the entire line.

Which Johnny was that? I can't see that suggestion in this
conversation, however Jordon suggested:
grep -iPo "<img[^>]+>" file.html |
sed -n 's/<img src=['"]([^"']*).*/1/pgI'

I only asked about another suggestion, I didn't suggest anything
myself. Don't want to take credit for other people's work…


Kind regards

Johnny Rosenberg
ジョニー・*ーゼンバーグ

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-12-2011, 02:15 PM
Patton Echols
 
Default Scripting / one liner help

On 08/12/2011 04:45 AM, Johnny Rosenberg wrote:

2011/8/11 Patton Echols<p.echols@comcast.net>:

Using the grep statement suggested by Johnny:

grep -io "<img[^>]+>"

solves it because grep is spitting out each match, not the entire line.

Which Johnny was that? I can't see that suggestion in this
conversation, however Jordon suggested:
grep -iPo "<img[^>]+>" file.html |
sed -n 's/<img src=['"]([^"']*).*/1/pgI'

I only asked about another suggestion, I didn't suggest anything
myself. Don't want to take credit for other people's work…



Oops, I meant Jordan, Sorry to you and especially Jordan.

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 

Thread Tools




All times are GMT. The time now is 03:09 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org