FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 08-25-2011, 12:35 AM
Scott Ferguson
 
Default A Bit of a Strange Situation

On 25/08/11 09:33, RiverWind wrote:
>
> Hey There,
>

<snipped>

> I have downloaded the linux cookbook, which consists of over five-
> hundred html files. I am wanting to concatenate them all into one
> big neat file, with all of the smaller files in perfect order. Now
> I know that "cat" can do this, but the file naming protocol is a
> bit strange. The names of the smaller files have to accommodate
> both "parts" and "sections", which makes for an interesting naming
> format. For instance, the first is named "cookbook1.html#SEC1." I
> tried the following command.
>
> cat *.html#SEC0*
>
> Now, were the files named something like "cookbook01-100.html",
> there wouldn't be a problem. However, how does one go about
> accommodating files that have two extensions? There is the standard
> ".html" extension, followed by the not so conventional "#SEC", and
> I am not sure how to work it into the cat command.

You have a number of problems. :-(

If you succeed you'll have an enormous .html file that your browser will
baulk at loading (that's why it was broken up in the first place).
The pages may not be in order.
The .html may not work.

>
> Any suggestions will be highly welcomed, because I am wanting to
> begin learning linux in earnest. I am trying to use books and
> manuals before yelling for help, and I am very much looking forward
> to the time when I can start giving help instead of always
> hollering for it.

Rather than suggest complicated methods of making a single page html
page for you - why not download the Linux Cookbook as a single pdf file?

http://www.usinglinux.org/docu/guides/linuxcookbook-1.2.pdf

>
> cheerio,
> Riv
>
> Feel free to visit my website and my blog and learn more about me
> and what I stand for.
> My Website @ http://riverwind.shellworld.net
> My Blog http://windraven13.livejournal.com/
>
>

For other useful docs see:-
http://www.debian.org/doc/
http://www.tldp.org/

and in the debian repository:-
installation-guide-[arch]
doc-linux-nonfree-html


Cheers

--
"If the FBI's motivating factor for busting down the Koresh compound was
child abuse, how come we never see Bradley tanks smashing into Catholic
churches?"
~ Bill Hicks


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E5598D7.6080704@gmail.com">http://lists.debian.org/4E5598D7.6080704@gmail.com
 
Old 08-25-2011, 09:46 AM
Curt
 
Default A Bit of a Strange Situation

On 2011-08-24, RiverWind <riverwind@shellworld.net> wrote:
>
> I have downloaded the linux cookbook, which consists of over five-
> hundred html files. I am wanting to concatenate them all into one
> big neat file, with all of the smaller files in perfect order. Now
> I know that "cat" can do this, but the file naming protocol is a
> bit strange. The names of the smaller files have to accommodate
> both "parts" and "sections", which makes for an interesting naming
> format. For instance, the first is named "cookbook1.html#SEC1." I
> tried the following command.

How 'bout just downloading one nice big neat pdf file?

http://www.usinglinux.org/docu/guides/linuxcookbook-1.2.pdf

You could convert that to html with 'pdftohtml'. Whether the resulting
document would meet your rigorous standards, I dunno.

If not, if you find the names of the html files in your possession
inconvenient, why not rename them?


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: slrnj5c6fs.2lq.curty@einstein.electron.org">http://lists.debian.org/slrnj5c6fs.2lq.curty@einstein.electron.org
 
Old 08-25-2011, 10:01 AM
Jude DaShiell
 
Default A Bit of a Strange Situation

pdf has accessibility issues for screen reader users and riverwind and
me are both screen reader users. The best we can attempt is a text
extraction from pdf files if we're going to read what's in them. If
what was left in the file was a scanned image, maybe that can be scanned
on Windows I don't know that parallel capability exists with Linux yet.
Also, whenever text extraction gets done on pdf files with command line
tools with Linux there are spelling mistakes in the output. The pdf
format is just something those of us that can't see the screen would be
really happy if either Adobe had never come into existence or invented
that format. Also, knowledgeable sighted technical people I talk with
hate Adobe and pdf with a passion and they can't all be wrong.

On Thu, 25 Aug 2011, Curt wrote:

> On 2011-08-24, RiverWind <riverwind@shellworld.net> wrote:
> >
> > I have downloaded the linux cookbook, which consists of over five-
> > hundred html files. I am wanting to concatenate them all into one
> > big neat file, with all of the smaller files in perfect order. Now
> > I know that "cat" can do this, but the file naming protocol is a
> > bit strange. The names of the smaller files have to accommodate
> > both "parts" and "sections", which makes for an interesting naming
> > format. For instance, the first is named "cookbook1.html#SEC1." I
> > tried the following command.
>
> How 'bout just downloading one nice big neat pdf file?
>
> http://www.usinglinux.org/docu/guides/linuxcookbook-1.2.pdf
>
> You could convert that to html with 'pdftohtml'. Whether the resulting
> document would meet your rigorous standards, I dunno.
>
> If not, if you find the names of the html files in your possession
> inconvenient, why not rename them?
>
>
>

Jude <jdashiel@shellworld.net>
"I love the Pope, I love seeing him in his Pope-Mobile, his three feet
of bullet proof plexi-glass. That's faith in action folks! You know he's
got God on his side."
~ Bill Hicks


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: alpine.BSF.2.00.1108250554310.85279@freire1.furyyj beyq.arg">http://lists.debian.org/alpine.BSF.2.00.1108250554310.85279@freire1.furyyj beyq.arg
 
Old 08-25-2011, 11:49 AM
Scott Ferguson
 
Default A Bit of a Strange Situation

On 25/08/11 20:01, Jude DaShiell wrote:
> pdf has accessibility issues for screen reader users

Some pdfs have issues.
Some of the pdf issues are accessibility. :-)
Some html files also have accessibility issues...

> and riverwind and me are both screen reader users.

And you are not alone.

> The best we can attempt is a text extraction from pdf files if we're
> going to read what's in them.

Then you have been sadly misinformed.
I have no problems reading the pdf I linked with Ocular (using kttsd) -
I prefer the html version, but I wouldn't want it as a single file.

I'd recommend careful preparation (food, drink, sleeping bag etc) before
attempting to screen read a single page documents made from 544 pages -
or spend the next few hours trying to kill speech-dispatch (without the
benefit of a reader) to find it's PID! ;-D

> If what was left in the file was a scanned image, maybe that can be
> scanned on Windows I don't know that parallel capability exists with
> Linux yet.

Usually the other way around. Eg. one day Windoof will have
screenreading built-in to the core and people will stop forking out big
dollars thinking JAWS is "assistive technology".

Tesseract does an excellent job of OCRing pdfs that are just image -
there are GUI options.

> Also, whenever text extraction gets done on pdf files with command
> line tools with Linux there are spelling mistakes in the output.

I'm assuming you use Orca (or whatever Gnome calls it's reader) - surely
that works with the Gnome PDF viewer?

> The pdf format is just something those of us that can't see the
> screen would be really happy if either Adobe had never come into
> existence or invented that format.

If Microsoft ceased to exist I'd agree - but they do, and the best I can
do with some "users" is get them to send me a pdf *instead* of a
"rent-a-view" Office document or some other proprietary method of making
information asymmetrically accessible.... It's a less than perfect world
so I accept less than perfect solutions.

> Also, knowledgeable sighted
> technical people I talk with hate Adobe and pdf with a passion and
> they can't all be wrong.

Originally Adobe *was* pdf. This is no longer the case - it was made an
open standard three years ago (ISO 32000-1:2008).

Plain text is good, RTF is nice, HTML is better.
Sadly, many people have problems with cross-platform text files, and
HTML is often made ugly and unusable, PDFs can be ugly too - but most
people have no problems viewing or printing them. So often pdfs are
often the "least worst" format for styled text and image documents. It's
also a handy format for saving reference webpages.

>
> On Thu, 25 Aug 2011, Curt wrote:
>
<snipped>
>
>

Cheers

--
"If the FBI's motivating factor for busting down the Koresh compound was
child abuse, how come we never see Bradley tanks smashing into Catholic
churches?"
~ Bill Hicks


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E5636AD.3070001@gmail.com">http://lists.debian.org/4E5636AD.3070001@gmail.com
 
Old 08-25-2011, 01:38 PM
Curt
 
Default A Bit of a Strange Situation

On 2011-08-25, Jude DaShiell <jdashiel@shellworld.net> wrote:

> pdf has accessibility issues for screen reader users and riverwind and

I missed the part about screen reading, if it was included in the OP.

I found the following for Gnome, with Orca:

http://live.gnome.org/Orca/Acroread#details

I admit I don't understand the requirement of accessing the document as
a single html file.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: slrnj5ck34.2ng.curty@einstein.electron.org">http://lists.debian.org/slrnj5ck34.2ng.curty@einstein.electron.org
 
Old 08-25-2011, 04:54 PM
Lisi
 
Default A Bit of a Strange Situation

On Thursday 25 August 2011 11:01:54 Jude DaShiell wrote:
> pdf has accessibility issues for screen reader users and riverwind and
> me are both screen reader users.

There are quite a few visually challenged people on this list. If someone has
sight problems, as you and River have, it is worth including that information
in the original posting. There is a good chance that some one else may
already have solved, by one means or another, the same problem, having had to
solve the problem for himself/herself.

Lisi


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 201108251754.00576.lisi.reisz@gmail.com">http://lists.debian.org/201108251754.00576.lisi.reisz@gmail.com
 
Old 08-25-2011, 05:44 PM
RiverWind
 
Default A Bit of a Strange Situation

Hi,

The idea was to concat a large html file and then convert it to text. The
pdf can be converted to text, and it so far seems like a pretty viable
translation.


Riv

Feel free to visit my website and my blog and learn more about me
and what I stand for.
My Website @ http://riverwind.shellworld.net
My Blog http://windraven13.livejournal.com/

On Thu, 25 Aug 2011, Curt wrote:


On 2011-08-25, Jude DaShiell <jdashiel@shellworld.net> wrote:


pdf has accessibility issues for screen reader users and riverwind and


I missed the part about screen reading, if it was included in the OP.

I found the following for Gnome, with Orca:

http://live.gnome.org/Orca/Acroread#details

I admit I don't understand the requirement of accessing the document as
a single html file.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/slrnj5ck34.2ng.curty@einstein.electron.org





--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: Pine.BSF.4.64.1108251343200.2308@server1.shellworl d.net">http://lists.debian.org/Pine.BSF.4.64.1108251343200.2308@server1.shellworl d.net
 
Old 08-25-2011, 07:45 PM
Bob Proulx
 
Default A Bit of a Strange Situation

RiverWind wrote:
> The idea was to concat a large html file and then convert it to
> text. The pdf can be converted to text, and it so far seems like a
> pretty viable translation.

If I were going to do that for myself I would convert each individual
html file to text first and then concatenate the individual text
files. The reason being that the individual html files are at that
moment completely consistent. Individually they should be able to
convert to text cleanly with no problems. And then the text can be
concatenated. But once you concatenate the html then you have created
a Frankenstein html file that is almost certainly going to be
problematic to convert to text.

Also, my naive experience with this is that converting html to text is
a lot easier than converting pdf to text. With html it is already a
text type. The mime type is "text/html" after all. But pdf has been
less accessible for conversions for me. The mime time is
"application/pdf" and isn't a text type. That introduces more room
for error to be introduced.

Bob
 
Old 08-25-2011, 08:57 PM
shawn wilson
 
Default A Bit of a Strange Situation

On Thu, Aug 25, 2011 at 15:45, Bob Proulx <bob@proulx.com> wrote:
> RiverWind wrote:
>> The idea was to concat a large html file and then convert it to
>> text. The pdf can be converted to text, and it so far seems like a
>> pretty viable translation.
>
> If I were going to do that for myself I would convert each individual
> html file to text first and then concatenate the individual text
> files. *The reason being that the individual html files are at that
> moment completely consistent. *Individually they should be able to
> convert to text cleanly with no problems. *And then the text can be
> concatenated. *But once you concatenate the html then you have created
> a Frankenstein html file that is almost certainly going to be
> problematic to convert to text.
>
> Also, my naive experience with this is that converting html to text is
> a lot easier than converting pdf to text. *With html it is already a
> text type. *The mime type is "text/html" after all. *But pdf has been
> less accessible for conversions for me. *The mime time is
> "application/pdf" and isn't a text type. *That introduces more room
> for error to be introduced.
>

yes, converting html to text is easier than converting pdf to text -
pdf is nice in the native format but when you get into extracting
stuff, it's a pain. pdf is not text. you can break the elements into a
dom like structure. however, html's dom and pdf's "dom" aren't the
same - pdf has an absolute x/y where the element is to be displayed
and the element can be binary data (ie a picture).

that said, i don't think there will be any accessibility issues with
that pdf and it might even convert cleanly (one has a lot to do with
the other). so, i would just go with the pdf and be done with it.
however, if you are hell bent on converting it to something, i would
use something that will keep some formatting - latex or pod come to
mind. maybe consider this:
http://cpan.uwinnipeg.ca/htdocs/Pod-HTML2Pod/Pod/HTML2Pod.html

the latex looks pretty simple too (though i have minimal experience with tex):
http://www.iwriteiam.nl/html2tex.html

per parsing those html files to figure out chapter, i'd personally use
perl and search for the chapter and section in the file, build up a
hash of that info and the file that contains it, sort and go from
there.

it does not seem that there is an easy way to go from pdf -> latex (as
i suspected).


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAH_OBics+bU-go+i4oizOJbWD0owe__GZaq+L_7=LFuxF+PCLw@mail.gmail. com">http://lists.debian.org/CAH_OBics+bU-go+i4oizOJbWD0owe__GZaq+L_7=LFuxF+PCLw@mail.gmail. com
 
Old 08-25-2011, 09:38 PM
Jude DaShiell
 
Default A Bit of a Strange Situation

I use orca very little but find the command line a better choice. I
have to use g.u.i. at work and it's nice to come home and not also have
to do that at home too.

Linux and Apple so far are the only two alternatives that ever got
accessibility mostly right. Windows 3.11 could be installed using a
screen reader under pcdos or msdos but Microsoft broke that capability
as fast as it could; I don't regard any operating system I can't
reinstall myself if necessary as being ready for my home use because of
that huge accessibility deficiency. I installed Tiger on my Mac Mini by
myself and went all the way up to Snow Leopard before the Mini died
permanently last week. I've installed Linux more than once by myself.
I've never been able to install windows in a bare metal scenario with
the screen reader working. Once I managed to install windows using a
sheet of brailled instructions and listening for when the cd drive
stopped spinning to do each instruction, but nobody regards that as
accessible these days.

On Thu, 25 Aug 2011, Scott Ferguson wrote:

> On 25/08/11 20:01, Jude DaShiell wrote:
> > pdf has accessibility issues for screen reader users
>
> Some pdfs have issues.
> Some of the pdf issues are accessibility. :-)
> Some html files also have accessibility issues...
>
> > and riverwind and me are both screen reader users.
>
> And you are not alone.
>
> > The best we can attempt is a text extraction from pdf files if we're
> > going to read what's in them.
>
> Then you have been sadly misinformed.
> I have no problems reading the pdf I linked with Ocular (using kttsd) -
> I prefer the html version, but I wouldn't want it as a single file.
>
> I'd recommend careful preparation (food, drink, sleeping bag etc) before
> attempting to screen read a single page documents made from 544 pages -
> or spend the next few hours trying to kill speech-dispatch (without the
> benefit of a reader) to find it's PID! ;-D
>
> > If what was left in the file was a scanned image, maybe that can be
> > scanned on Windows I don't know that parallel capability exists with
> > Linux yet.
>
> Usually the other way around. Eg. one day Windoof will have
> screenreading built-in to the core and people will stop forking out big
> dollars thinking JAWS is "assistive technology".
>
> Tesseract does an excellent job of OCRing pdfs that are just image -
> there are GUI options.
>
> > Also, whenever text extraction gets done on pdf files with command
> > line tools with Linux there are spelling mistakes in the output.
>
> I'm assuming you use Orca (or whatever Gnome calls it's reader) - surely
> that works with the Gnome PDF viewer?
>
> > The pdf format is just something those of us that can't see the
> > screen would be really happy if either Adobe had never come into
> > existence or invented that format.
>
> If Microsoft ceased to exist I'd agree - but they do, and the best I can
> do with some "users" is get them to send me a pdf *instead* of a
> "rent-a-view" Office document or some other proprietary method of making
> information asymmetrically accessible.... It's a less than perfect world
> so I accept less than perfect solutions.
>
> > Also, knowledgeable sighted
> > technical people I talk with hate Adobe and pdf with a passion and
> > they can't all be wrong.
>
> Originally Adobe *was* pdf. This is no longer the case - it was made an
> open standard three years ago (ISO 32000-1:2008).
>
> Plain text is good, RTF is nice, HTML is better.
> Sadly, many people have problems with cross-platform text files, and
> HTML is often made ugly and unusable, PDFs can be ugly too - but most
> people have no problems viewing or printing them. So often pdfs are
> often the "least worst" format for styled text and image documents. It's
> also a handy format for saving reference webpages.
>
> >
> > On Thu, 25 Aug 2011, Curt wrote:
> >
> <snipped>
> >
> >
>
> Cheers
>
>

Jude <jdashiel@shellworld.net>
"I love the Pope, I love seeing him in his Pope-Mobile, his three feet
of bullet proof plexi-glass. That's faith in action folks! You know he's
got God on his side."
~ Bill Hicks


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: alpine.BSF.2.00.1108251729580.8523@freire1.furyyjb eyq.arg">http://lists.debian.org/alpine.BSF.2.00.1108251729580.8523@freire1.furyyjb eyq.arg
 

Thread Tools




All times are GMT. The time now is 05:35 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org