2005/02/23

converting pdf to text

I needed to convert some pdf files to text,
some of which I then converted to .doc .

I tried several tools but believe the best is
http://www.foolabs.com/xpdf/download.html

The windows version does not come with a fancy
interface (I think); you have to run through MS-DOS.

You need to check the output. For instance,
page numbers, page headers, page footers and
watermark will appear too. In one case,
the watermark "MINUTA" (draft in Portuguese)
was broken into "MI" and "NUTA" scattered
over the page.

Another problem is that it doesn't treat accents well.
I tried to write an AWK script to patch this,
but it didn't work in all cases (the hatted (^) letters
didn't come out well).

The best thing to do is to enter the text afterwards in
a program that does spelling correction, and in a few
minutes you correct everything.