FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Ubuntu > Ubuntu User

 
 
LinkBack Thread Tools
 
Old 08-06-2008, 02:35 AM
"John Toliver"
 
Default Off-Topic: Parse an html file and transfer the text found

I have a CD which came with a textbook I use for school. The CD is a
list of commonly prescribed drugs. I am entering these drugs one by
one into a database I've created. I thought about it and since the
files are html files called in a number of ways via what looks like
javascript, I was thinking that I could build a script using some
language, maybe PERL or python and program it to parse the html file
and transfer it to the my hsqldb, and place the information into the
proper fields in the database.

So my question to start is which language should I use to pull the
data out of an html file? Is perl better for this application, or is
python better or some other language?

I'm probably going to need to brush up on my regular expressions for
this but that's ok too.

Any thoughts would be appreciated...

--
I've discovered the key to success is to never give up. You either
learn the right way, or you run out of ways to do it wrong. A win/win
situation!

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 08:08 AM
Markus Schönhaber
 
Default Off-Topic: Parse an html file and transfer the text found

John Toliver wrote:

> So my question to start is which language should I use to pull the
> data out of an html file?

The one that you're familiar with is, IMO, the primary choice.

> Is perl better for this application, or is
> python better or some other language?

I'm not too familiar with Perl but have done quite some Python
programming over the years. Therefore I don't have an unbiased view in
this regard, nevertheless I doubt that one has a massive advantage over
the other when it comes to text processing.

> I'm probably going to need to brush up on my regular expressions for
> this but that's ok too.
>
> Any thoughts would be appreciated...

To extract data from HTML there are to ways to approach the problem that
seem obvious to me:
1. See HTML as text.
2. See HTML as structured data.

In the 1. case, you could use REs to extract the wanted data. To me, it
seems that this is what you have in mind.

In the 2. case, you could use an appropriate parser that helps you
navigate the document and access the wanted data.
For example: depending on the quality of the HTML document it might
already be well formed XML (or could easily be converted to it using
something like HTML tidy). You could then load it with an XML parser and
use it's methods to navigate to the data you're interested in.
You could even use XSLT to print out the desired SQL statements and do
no Python/Perl/whatever programming at all.

To sum thing up: IMO there is not the one best way or the one best
programming language to get the desired result. What's best for you
largely depends on what you're familiar with and what matches your
personal preference best.

Regards
mks

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 09:22 AM
Leo Cacciari
 
Default Off-Topic: Parse an html file and transfer the text found

Il giorno mer, 06/08/2008 alle 10.08 +0200, Markus Schönhaber ha
scritto:
> John Toliver wrote:
>
> > So my question to start is which language should I use to pull the
> > data out of an html file?
>
> The one that you're familiar with is, IMO, the primary choice.
>
> > Is perl better for this application, or is
> > python better or some other language?
>
> I'm not too familiar with Perl but have done quite some Python
> programming over the years. Therefore I don't have an unbiased view in
> this regard, nevertheless I doubt that one has a massive advantage over
> the other when it comes to text processing.
>

Well, I'll tend to disagree, but then I'm perl biased, thus my maybe my
advice is to be taken "cum grano salis"

> > I'm probably going to need to brush up on my regular expressions for
> > this but that's ok too.
> >
> > Any thoughts would be appreciated...
>
There is a wonderful book on RE in the O'Reilly series, explaining how
to use it in different languages "Mastering Regular Expressions", by
Jeffrey Friedl.

If you decide by Perl (not PERL, this is another thing...), you could
find useful the HTML::Tree module
(http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree.pm)



> ...snip....
> To sum thing up: IMO there is not the one best way or the one best
> programming language to get the desired result. What's best for you
> largely depends on what you're familiar with and what matches your
> personal preference best.
>

And this is nothing but the truth

Enjoy
--
Leo Cacciari

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 02:37 PM
Derek Broughton
 
Default Off-Topic: Parse an html file and transfer the text found

John Toliver wrote:

> So my question to start is which language should I use to pull the
> data out of an html file? Is perl better for this application, or is
> python better or some other language?

Yes :-) Any language that has tools for parsing HTML that you're
comfortable with would be good. If the files are guaranteed valid XHTML,
you probably have even more choices probably, but certainly Perl or Python
should be fine, and I'd use Python.
>
> I'm probably going to need to brush up on my regular expressions for
> this but that's ok too.

That's why if they're XHTML, it's easier - because then the files should
parse with an XML parser and be really easy to extract the meaningful data
from.
--
derek


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 03:18 PM
"John Toliver"
 
Default Off-Topic: Parse an html file and transfer the text found

I want to send a pastebin because I think it's html with javascript
embedded, but I'm not sure......

On Wed, Aug 6, 2008 at 10:37, Derek Broughton <news@pointerstop.ca> wrote:
> John Toliver wrote:
>
>> So my question to start is which language should I use to pull the
>> data out of an html file? Is perl better for this application, or is
>> python better or some other language?
>
> Yes :-) Any language that has tools for parsing HTML that you're
> comfortable with would be good. If the files are guaranteed valid XHTML,
> you probably have even more choices probably, but certainly Perl or Python
> should be fine, and I'd use Python.
>>
>> I'm probably going to need to brush up on my regular expressions for
>> this but that's ok too.
>
> That's why if they're XHTML, it's easier - because then the files should
> parse with an XML parser and be really easy to extract the meaningful data
> from.
> --
> derek
>
>
> --
> ubuntu-users mailing list
> ubuntu-users@lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
>



--
I've discovered the key to success is to never give up. You either
learn the right way, or you run out of ways to do it wrong. A win/win
situation!

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 03:23 PM
"Bill Walton"
 
Default Off-Topic: Parse an html file and transfer the text found

Hi John,

John Toliver wrote:

> So my question to start is which language should I use to pull the
> data out of an html file? Is perl better for this application, or is
> python better or some other language?

My personal preference is Ruby, but any language with an XML parser would
do. It's just that Ruby makes it enjoyable ;-)

Best regards,
Bill


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 03:40 PM
Leo Cacciari
 
Default Off-Topic: Parse an html file and transfer the text found

Il giorno mer, 06/08/2008 alle 10.23 -0500, Bill Walton ha scritto:
> Hi John,
>
> John Toliver wrote:
>
> > So my question to start is which language should I use to pull the
> > data out of an html file? Is perl better for this application, or is
> > python better or some other language?
>
> My personal preference is Ruby, but any language with an XML parser would
> do. It's just that Ruby makes it enjoyable ;-)
>
Both the pyton-lover and myself are gone out of our way trying to
avoid religious wars about scripting language (which are almost as bad
than the ones about emacs/vi, and useful as a sore finger when you have
something to type....) please, do the same :-)

Enjoy
--

Leo Cacciari
--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 03:46 PM
Leo Cacciari
 
Default Off-Topic: Parse an html file and transfer the text found

Il giorno mer, 06/08/2008 alle 11.18 -0400, John Toliver ha scritto:
> I want to send a pastebin because I think it's html with javascript
> embedded, but I'm not sure......
Please, try to not top-post....

It all depends what the javascript is for, if it is some REST thing,
then you have some problem, as the "visible" content of the page would
depends from the interaction of those REST component with the server,
and parsing the html+javascript will lead you nowhere.

On the other hand, if the javascript is there for making some visual
effect, without adding to the data you are interested in, then it is
easy to eliminate it at parsing time.

Enjoy

--
Leo Cacciari

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 04:16 PM
NoOp
 
Default Off-Topic: Parse an html file and transfer the text found

On 08/05/2008 07:35 PM, John Toliver wrote:
> I have a CD which came with a textbook I use for school. The CD is a
> list of commonly prescribed drugs. I am entering these drugs one by
> one into a database I've created. I thought about it and since the
> files are html files called in a number of ways via what looks like
> javascript, I was thinking that I could build a script using some
> language, maybe PERL or python and program it to parse the html file
> and transfer it to the my hsqldb, and place the information into the
> proper fields in the database.
>
> So my question to start is which language should I use to pull the
> data out of an html file? Is perl better for this application, or is
> python better or some other language?
>
> I'm probably going to need to brush up on my regular expressions for
> this but that's ok too.
>
> Any thoughts would be appreciated...
>

Have you tried opening the html files in Calc (OpenOffice.org)? Give it
a try; you may find that the files are structured sufficiently to parse
the drug names in an orderly fashion & then use that spreadsheet to
directly create a Base (OOo) database.



--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 08-06-2008, 11:37 PM
Derek Broughton
 
Default Off-Topic: Parse an html file and transfer the text found

NoOp wrote:
> Have you tried opening the html files in Calc (OpenOffice.org)? Give it
> a try; you may find that the files are structured sufficiently to parse
> the drug names in an orderly fashion & then use that spreadsheet to
> directly create a Base (OOo) database.

Good point. The odds are probably greatly against it - but the advantage if
it works is well worth the small effort to give it a try.
--
derek


--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 

Thread Tools




All times are GMT. The time now is 02:40 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org