FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 01-31-2010, 01:54 AM
Zhang Weiwu
 
Default remove an HTML tag and all its children from commandline

Hello. I believe this is a common case and must have been discussed
before on various other forums like awk/sed/regular expression group.
However I could not google them out. You would be helping me a lot if
you simply point to a reference to a solution.

I want to remove all advertisements in my 100 html files. They are
pretty neatly classed, like the following:

<div class="advertisement">
...
</div>

However I could not simply do this:
s/<div class="advertisement">.*</div>//

Because it is too greedy, that matches the "</div>" till the last, which
is almost always after the advertisement.

If I set it to not to be greedy, it also fail because it stops at the
first </div> inside the advertisement.

Consider this case that both greedy and non-greedy fail:

<div class="page-content">
<div class="advertisement">
<div>Our product is the best</div>
<div>Contact us now!</div>
</div>
</div>

Greedy output:

<div class="page-content">

Non-greedy output:

<div class="page-content">
<div>Contact us now!</div>
</div>
</div>


Expected output:

<div class="page-content">
</div>

The only way to make it right seems to be able to give the replacement /
remove expression the ability to "count" the number of <div and </div>
it encounters. I could program such thing in C thanks to my college
education, but it sounds overkill for such a common task. What would you
do in this case?


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 02:48 AM
T o n g
 
Default remove an HTML tag and all its children from commandline

On Sun, 31 Jan 2010 10:54:46 +0800, Zhang Weiwu wrote:

> I want to remove all advertisements in my 100 html files. They are
> pretty neatly classed, like the following:
>
> <div class="advertisement">
> ...
> </div>
>
> However I could not simply do this:
> s/<div class="advertisement">.*</div>//
>
> Because it is too greedy

For not-so-simple tasks, you need not-so-simple tools. Depending on how
much time you'd like to investigate into such not-so-simple tools, take a
look at libwwww?, sgrep or the xpath language.

HTH

--
Tong (remove underscore(s) to reply)
http://xpt.sourceforge.net/techdocs/
http://xpt.sourceforge.net/tools/


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 03:55 AM
Celejar
 
Default remove an HTML tag and all its children from commandline

On Sun, 31 Jan 2010 10:54:46 +0800
Zhang Weiwu <zhangweiwu@realss.com> wrote:

...

> I want to remove all advertisements in my 100 html files. They are
> pretty neatly classed, like the following:
>
> <div class="advertisement">
> ...
> </div>
>
> However I could not simply do this:
> s/<div class="advertisement">.*</div>//
>
> Because it is too greedy, that matches the "</div>" till the last, which
> is almost always after the advertisement.
>
> If I set it to not to be greedy, it also fail because it stops at the
> first </div> inside the advertisement.

...

> The only way to make it right seems to be able to give the replacement /
> remove expression the ability to "count" the number of <div and </div>
> it encounters. I could program such thing in C thanks to my college
> education, but it sounds overkill for such a common task. What would you
> do in this case?

"Among programmers of any experience, it is generally regarded as A Bad
Ideatm to attempt to parse HTML with regular expressions. How bad of an
idea? It apparently drove one Stack Overflow user to the brink of
madness:

"You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML. As
I have answered in HTML-and-regex questions here so many times before,
the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to
understand the constructs employed by HTML. HTML is not a regular
language and hence cannot be parsed by regular expressions. Regex
queries are not equipped to break down HTML into its meaningful parts.
so many times but it is not getting to me. Even enhanced irregular
regular expressions as used by Perl are not up to the task of parsing
HTML. You will never make me crack. HTML is a language of sufficient
complexity that it cannot be parsed by regular expressions.

Even Jon Skeet cannot parse HTML using regular expressions. Every time
you attempt to parse HTML with regular expressions, the unholy child
weeps the blood of virgins, and Russian hackers pwn your webapp.
Parsing HTML with regex summons tainted souls into the realm of the
living. HTML and regex go together like love, marriage, and ritual
infanticide. The <center> cannot hold it is too late. The force of
regex and HTML together in the same conceptual space will destroy your
mind like so much watery putty. If you parse HTML with regex you are
giving in to Them and their blasphemous ways which doom us all to
inhuman toil for the One whose Name cannot be expressed in the Basic
Multilingual Plane, he comes."

That's right, if you attempt to parse HTML with regular expressions,
you're succumbing to the temptations of the dark god Cthulhu's … er …
code."

http://www.codinghorror.com/blog/archives/001311.html

Read on for more detail, and the Right Way to do this.

Celejar
--
foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator
mailmin.sourceforge.net - remote access via secure (OpenPGP) email
ssuds.sourceforge.net - A Simple Sudoku Solver and Generator


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 09:20 AM
Zhang Weiwu
 
Default remove an HTML tag and all its children from commandline

T o n g 写道:
> For not-so-simple tasks, you need not-so-simple tools. Depending on how
> much time you'd like to investigate into such not-so-simple tools, take a
> look at libwwww?, sgrep or the xpath language.
>
Sure. libwww and sgrep are tools, while xpath is a language. I believe I
should try xpath because I might use use it in other places too, but
what tool to use for xpath? Is there a handy commandline too for it? The
thing I worry a bit about xpath is: if it normalize or correct HTML
errors, or align it differently, in the output, after I have done the
removal, it would be big a problem for me, because I am a link on the
corporate workflow chain where others rely on poorly made tools and
incorrect and turbulent HTML to do their daily work and I must not break
them by improving the HTML, unless I do not want to keep current
peaceful and lazy life and save time for more valuable sane projects.

I am pretty sure sgrep can solve my problem after glanced the manual,
though.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 09:31 AM
Steve Kemp
 
Default remove an HTML tag and all its children from commandline

On Sun Jan 31, 2010 at 10:54:46 +0800, Zhang Weiwu wrote:

> I want to remove all advertisements in my 100 html files. They are
> pretty neatly classed, like the following:
>
> <div class="advertisement">
> ...
> </div>

You might enjoy my "html-tool" command which would do the
job for you via:

html-tool --cut-class=advertisement --file input.html

You can get it via:

wget http://mybin.repository.steve.org.uk/raw-file/tip/html-tool

Or via the repository at:

http://mybin.repository.steve.org.uk/

See here for some brief discussion:

http://blog.steve.org.uk/oh__this_should_be_stunning_.html

Internally it uses the XPath perl module HTML::TreeBuilder::Xpath,
but the details probably don't matter.

Steve
--


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 09:45 AM
Zhang Weiwu
 
Default remove an HTML tag and all its children from commandline

Steve Kemp 写道:
>
> You might enjoy my "html-tool" command which would do the
> job for you via:
>
Thank you very much for mentioning this tool. A first glance it seems
this tool is just too wonderful, it is just designed to solve problems
like mine. However after I try it what I worry most happened:
> The
> thing I worry a bit about xpath is: if it normalize or correct HTML
> errors, or align it differently, in the output, after I have done the
> removal, it would be big a problem for me, because I am a link on the
> corporate workflow chain where others rely on poorly made tools and
> incorrect and turbulent HTML to do their daily work and I must not break
> them by improving the HTML, unless I do not want to keep current
> peaceful and lazy life and save time for more valuable sane projects.
Unfortunately it does. The output HTML no longer work with the stupid
drag-and-drop-html-edit-for-idiot my "web design guy" is using. I am in
position of delivering a signed contract, not in evaluating if a
contract can be done, this situation means I cannot take html-tool as an
option. But I will well keep it in mind to use when feasible!

As time is tight I guess I just use the most turbulent solution: adding
the following to all HTML pages:

<style type="text/css">
.advertisement {
display: none;
}
</style>

It is a silly solution that punishes web visitor for web designer's
fault. But on the other hand, I think the web designer who made the junk
HTML really should not enjoy too much help from me. Maybe I just let it
go this way.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 11:05 AM
Zhang Weiwu
 
Default remove an HTML tag and all its children from commandline

Zhang Weiwu 写道:
> Sure. libwww and sgrep are tools, while xpath is a language. I believe I
> should try xpath because I might use use it in other places too, but
> what tool to use for xpath?
Now I think I can answer my own question, partly at least. There is a
good tool for xpath that is named xpath. In debian it is in this package:
$ apt-file search /usr/bin/xpath
libxml-xpath-perl: /usr/bin/xpath

An example of using the tool: print the "advertisement" is:

$ tidy -q -asxml -utf8 page_07_zh.html | xpath -e '//div[@class="advertisement"]'


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-31-2010, 05:07 PM
T o n g
 
Default remove an HTML tag and all its children from commandline

On Sun, 31 Jan 2010 20:05:46 +0800, Zhang Weiwu wrote:

> $ tidy -q -asxml -utf8 page_07_zh.html | xpath -e
> '//div[@class="advertisement"]'

exactly. Glad that you found both tidy & libxml-xpath-perl, and solve the
problem yourself.

--
Tong (remove underscore(s) to reply)
http://xpt.sourceforge.net/techdocs/
http://xpt.sourceforge.net/tools/


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 

Thread Tools




All times are GMT. The time now is 05:53 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org