How to Handle PDF file

For many translators and PMs, PDF files could be the most rebarbative files they don’t want to handle. They are really nightmarish.

Is there any good way to edit this PDF document or calculate the word count quickly for quotation?

The answer is that there is no shortcut. It is possible, but only limitedly, and quite possibly your client will not be happy with the results

There are two kinds of PDF we often meet; one is output from original DTP file such as Quark Xpress, FrameMaker or InDesign, the other is scanned PDF files.

It is relatively easy to handle the PDF file output from the original DTP file.

There are two ways to convert the text into MS Word.

1. You can directly re-save the PDF as Word file, got to “File”, and click “save as” and choose *doc or *rtf in “save type”.

2. What we can do is try to extract the text from the PDF and process it using MS Word. The most straightforward and economical procedure is simply pressing the “Select Text” button in Acrobat Reader, and pressing Ctrl-A (Select All), and copy the contents to the clipboard (Ctrl-V). Then you can paste this into MS Word. It is good way to calculate the word count for your client.

Depending on the complexity of the page layout, the above method may prove minimally satisfactory. Most of time we just need to copy-paste a PDF to Word and the result is usable and it is also effective if there are no many pages involved.

But if the client need the perfect layout same as the source, what should we do?

Using the way of copy-paste, we can keep fonts and type sizes, but tables disappear, although their content is preserved (in a somewhat mangled form), illustrations are gone. The main problem is that each line ends in a hard carriage return/line feed, which generally has to be replaced by a single space in order to have continuous sentences again. So we have to delete these returns done one line at a time, under human supervision.

And it happens that the PDF is scanned one; we cannot make copy-past at all, what should we do?

Conversion method

In many instances, an automatic conversion program is preferable. You can use OCR software such as ABBY, Scansoft PDF Converter, which can help to convert the PDF into editable word file and keep the format same as the source.

PDFs can also be “password protected”. If you do not have the password you cannot extract text from them. Character converters cannot process these files unless provided with the password. OCR converters can handle them perfectly, as they just “look at the pages”, not using their internal character coding.

Conversion problems

The conversion result is not always perfect, though. PDFs are very complicated indeed.

If the layout of the PDF is complex with many graphics, tables, columns and boxes, all these will cause problems during the conversion and make conversion result very poor, the converted file will be full of strange styles, disparate measures, unconventional character and line spacing. You can use the file to calculate words, but not suitable for translation process.

In such case, we have to resort to manual extraction. Also, some extraction programs offer a menu of layout options for the converted file: you can select from the full recreation of the original appearance, to plain text extraction.

If your client requests the completely same layout with the source PDF, then tell him/her, what they request is not just translation, but translation + DTP job.