I have a 96 page pdf file that I need to convert to text in one run.
I've imported it into inkscape but that only converts one page at a
time. I've tried using pdftotext but i cant work out the syntax for
that so am unable to test it out properly. I've tried pdfedit but that
only works on one page at a time and doesnt convert it to text.
Can anyone help me out with suggestions for converting the pdf in one
go to text please?
Many thanks
Sharon.
--
A taste of linux http://www.sharons.org.uk/taste/index.html
efever http://www.efever.blogspot.com/
Debian 6.0.2, KDE 4.4.5, LibreOffice 3.4.3
Registered Linux user 334501
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAM9u--epSPT+mEuuLetN7SYpwQ6W3ruYYZDiHMvhSE_NGzeHyg@mail. gmail.com">http://lists.debian.org/CAM9u--epSPT+mEuuLetN7SYpwQ6W3ruYYZDiHMvhSE_NGzeHyg@mail. gmail.com
09-22-2011, 05:57 AM
The_Ace
Convert a pdf to text
On Thu, Sep 22, 2011 at 11:01 AM, Sharon Kimble <skimble04@gmail.com> wrote:
> I have a 96 page pdf file that I need to convert to text in one run.
> I've imported it into inkscape but that only converts one page at a
> time. I've tried using pdftotext but i cant work out the syntax for
> that so am unable to test it out properly. I've tried pdfedit but that
> only works on one page at a time and doesnt convert it to text.
>
> Can anyone help me out with suggestions for converting the pdf in one
> go to text please?
>
> Many thanks
> Sharon.
> --
Use pdftotext if you want it converted to plain text. Like this :
pdftotext -layout /path/to/pdffile.pdf /path/to/textfile.txt
or if you want it to be html (text only) :
pdftotext -format -htmlmeta /path/to/pdffile.pdf /path/to/textonlyHTMLfile.html
If you want to save images, colors and other formatting as well, then
you can convert only to html. Use pdtohtml for that.
Note that pdfto html is memory intensive.
To convert to a single html file for the content :
pdftohtml -p -nodrm /path/to/pdffile.pdf /path/to/htmlfile.html
this actually creates 3 html files :
htmlfile.html - the main file to view
htmlfiles.html - the full converted single html file
htmlfile_ind.html - Navigation page.
To convert to multiple html files (one html file for each page) :
pdftohtml -c -p -nodrm /path/to/pdffile.pdf /path/to/htmlfile.html
this create 2 main files along with one html page for each page in the book :
htmlfile.html - the main file to view
htmlfile_ind.html - the navigation page
Keep in mind that pdftohtml is memory intensive and creating a single
paged html file is extremly memory intensive.
--
The mysteries of the Universe are revealed when you break stuff.
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAM8yCh_MkjWBiRRmrswQ+kMd6n5aBtda1O-LPF8XqqiK=fk=Zg@mail.gmail.com">http://lists.debian.org/CAM8yCh_MkjWBiRRmrswQ+kMd6n5aBtda1O-LPF8XqqiK=fk=Zg@mail.gmail.com
09-22-2011, 06:00 AM
Scott Ferguson
Convert a pdf to text
On 22/09/11 15:31, Sharon Kimble wrote:
> I have a 96 page pdf file that I need to convert to text in one run.
> I've imported it into inkscape but that only converts one page at a
> time. I've tried using pdftotext but i cant work out the syntax for
> that so am unable to test it out properly. I've tried pdfedit but that
> only works on one page at a time and doesnt convert it to text.
>
> Can anyone help me out with suggestions for converting the pdf in one
> go to text please?
>
> Many thanks
> Sharon.
Do you mean a multi-page or many pages?
Converting all of a multi-page pdf is just:-
$ pdftotext multipage_example.pdf
which will produce a single text file called multipage_example.txt
containing all the text from the pdf.
If you want to preserve the format try pdftohtml
If some (or all) of the content is images of text try tesseract - though
you'll have to do a little preparation.
Ocular will also export a pdf to text (providing all the text in the pdf
is actual text, not images)
Cheers
--
"People say to me, "Bill, quit bringing up Kennedy, man. Let it go. It
was a long time ago. Just forget about it."
All right, then don't bring up Jesus to me. I mean, as long as we're
talking shelf-life here.
"You know, Bill, Jesus died for you …" Yeah, it was a long time ago.
Forget about it.
How about this: get Pilate to release the [beep]in' files. Quit washing
your hands, Pilate, and release the files. Who else was on that grassy
Golgotha that day? Oh yeah, the three Roman peasants in $100 sandals"
— Bill Hicks
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E7ACF0D.9020507@gmail.com">http://lists.debian.org/4E7ACF0D.9020507@gmail.com
09-22-2011, 06:03 AM
Doug
Convert a pdf to text
On 09/22/2011 01:31 AM, Sharon Kimble wrote:
I have a 96 page pdf file that I need to convert to text in one run.
I've imported it into inkscape but that only converts one page at a
time. I've tried using pdftotext but i cant work out the syntax for
that so am unable to test it out properly. I've tried pdfedit but that
only works on one page at a time and doesnt convert it to text.
Can anyone help me out with suggestions for converting the pdf in one
go to text please?
Many thanks
Sharon.
According to LibreOffice.org, there is an extension to LO that will
import pdfs.
I have not used it, so I don't know if it will read the whole file in at
once or not.
It may or may not be built in to LO 3.3, Google does not say, but if not
you should
be able to download it for that or an earlier version.
Apparently there is also a downloadable extension for OpenOffice 3.0.
(These are
very likely to be the same code.)
Finally, if there is nothing that will read more than one page at a
time, it should be
possible to write a script that would take the labor out of it. If you
can't figure out
how to do that, somebody on the list here can probably help. (Not
me--I'm a green
novice at bash scripting!) Then share the script with the readers
here--I'm sure it
would find happy users!
--doug
--
Blessed are the peacemakers...for they shall be shot at from both sides. --A. M. Greeley
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
On 22 September 2011 07:03, Doug <dmcgarrett@optonline.net> wrote:
> On 09/22/2011 01:31 AM, Sharon Kimble wrote:
>>
>> I have a 96 page pdf file that I need to convert to text in one run.
>> I've imported it into inkscape but that only converts one page at a
>> time. I've tried using pdftotext but i cant work out the syntax for
>> that so am unable to test it out properly. I've tried pdfedit but that
>> only works on one page at a time and doesnt convert it to text.
>>
>> Can anyone help me out with suggestions for converting the pdf in one
>> go to text please?
>>
>> Many thanks
>> Sharon.
>
> According to LibreOffice.org, there is an extension to LO that will import
> pdfs.
> I have not used it, so I don't know if it will read the whole file in at
> once or not.
> It may or may not be built in to LO 3.3, Google does not say, but if not you
> should
> be able to download it for that or an earlier version.
> Apparently there is also a downloadable extension for OpenOffice 3.0.
> *(These are
> very likely to be the same code.)
>
> Finally, if there is nothing that will read more than one page at a time, it
> should be
> possible to write a script that would take the labor out of it. *If you
> can't figure out
> how to do that, somebody on the list here can probably help. *(Not me--I'm a
> green
> novice at bash scripting!) *Then share the script with the readers here--I'm
> sure it
> would find happy users!
>
> --doug
>
> --
> Blessed are the peacemakers...for they shall be shot at from both sides.
> --A. M. Greeley
>
>
> --
Many thanks to all, I was able to do it using pdftotext. The file was
named like 'UK Households 2005' and it would not convert, so I renamed
it as 'UKHouseholds2005' and it was converted straight away with no
messing.
Thanks again
Sharon.
--
A taste of linux http://www.sharons.org.uk/taste/index.html
efever http://www.efever.blogspot.com/
Debian 6.0.2, KDE 4.4.5, LibreOffice 3.4.3
Registered Linux user 334501
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAM9u--fypG55BLCykv7VPvgb3ca8SRzSSFjE_Dn2WFRs5MnMWA@mail. gmail.com">http://lists.debian.org/CAM9u--fypG55BLCykv7VPvgb3ca8SRzSSFjE_Dn2WFRs5MnMWA@mail. gmail.com
09-22-2011, 06:33 AM
Scott Ferguson
Convert a pdf to text
On 22/09/11 16:27, Sharon Kimble wrote:
> On 22 September 2011 07:03, Doug <dmcgarrett@optonline.net> wrote:
>> On 09/22/2011 01:31 AM, Sharon Kimble wrote:
>>>
<snipped>
>> --
> Many thanks to all, I was able to do it using pdftotext. The file was
> named like 'UK Households 2005' and it would not convert, so I renamed
> it as 'UKHouseholds2005' and it was converted straight away with no
> messing.
For future reference:-
$ pdftotext "UK Households 2005".pdf
OR
$ pdftotext "UK Households 2005.pdf"
>
> Thanks again
> Sharon.
Cheers
--
"People say to me, "Bill, quit bringing up Kennedy, man. Let it go. It
was a long time ago. Just forget about it."
All right, then don't bring up Jesus to me. I mean, as long as we're
talking shelf-life here.
"You know, Bill, Jesus died for you …" Yeah, it was a long time ago.
Forget about it.
How about this: get Pilate to release the [beep]in' files. Quit washing
your hands, Pilate, and release the files. Who else was on that grassy
Golgotha that day? Oh yeah, the three Roman peasants in $100 sandals"
— Bill Hicks
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E7AD6C4.70502@gmail.com">http://lists.debian.org/4E7AD6C4.70502@gmail.com