栏目导航 网站首页>>本地化>>技术与工具

翻译PDF文件

发表日期：2008年12月20日共浏览17826 次出处：Proz.com 作者：Hector Calabia 【编辑录入：giltworld】
字体颜色：【字体：放大正常缩小】【双击鼠标左键自动滚屏】【图片上滚动鼠标滚轮变焦图片】

Handling PDF Documents

For many translators, Acrobat Portable Document Format files are nightmarish; even more, in forums and mailing lists periodically the question reappears again: How can I edit this PDF document?

Short answer: you can’t.

Long answer: it is possible, but only limitedly, and quite possibly your client will not be happy with the results.

The point is that PDF documents were never meant for editing or translation. About ten years ago, Adobe Inc., its creator, was successful in responding to a market need: documents that could be easily exchanged, printed, and viewed but not modified on all kind of computers. That is, the computer equivalent to a printed document. This is exactly what Acrobat documents are.

They must be considered printed” documents, not “editable” documents. The format has been so successful, that there is a steady flow of PDF documents to translate. However, as most computer formats are editable, there is a problem: despite the original intention of its creators, people cannot be convinced that PDFs are uneditable, and so they ask for translations and modifications to these documents.

Many translators already know the answer: you cannot deliver a translation on PDF, at least on the same PDF that you have been delivered. The format is not (extensively) editable.

What you can do is try to extract the text from the PDF and process it using your favorite word processor. The most straightforward and economical procedure is simply pressing the “Select Text” button in Acrobat Reader, and pressing Ctrl-A (Select All), and copy the contents to the clipboard (Ctrl-V). Then you can paste this into your word processor.

Depending on the complexity of the page layout, this may prove minimally satisfactory. Although I haven’t used this process for some time, I have just copied-pasted a PDF to Word and the result is usable, at a pinch: fonts and type sizes are kept, tables disappear, although their content is preserved (in a somewhat mangled form), illustrations are gone. The main problem is that each line ends in a hard carriage return/line feed, which generally has to be replaced by a single space in order to have continuous sentences again. I have developed a small Word macro that searches for carriage returns and replaces them by spaces. This however, has to be done one line at a time, under human supervision, because the system cannot know when the carriage return should be kept (for instance, in headings, lists, and at the end of paragraphs.)

Automatic conversion

In many instances, an automatic conversion program is preferable. I have used both a “pure” PDF to Word converter (Scansoft PDF Converter) and optical character recognition software (Omnipage and Fine Reader). You can find a healthy provision of both types by doing an Internet search on “PDF extraction” or “PDF conversion”.

What is the difference between them?

I have already said that PDFs are like printed documents. In most cases, however, the text is kept as computer characters, that is, you can copy/paste it. In some other cases, all the text (or some of it, in headings, for instance) is just an image, like the characters on a faxed page. “Pure” PDF converters can handle computer characters, but they choke on graphics. If a document contains all or some “graphics-characters” areas, they cannot process them. In this instance, optical character recognition programs come to the rescue. They look at the page as if it were really a printed page, and they try to interpret it and convert it to computer characters. They may also extract the illustrations on the page. It is not necessary to print and scan the Acrobat pages for this. Modern OCR software accepts them directly.

PDFs can also be “password protected”. If you do not have the password you cannot extract text from them. Character converters cannot process these files unless provided with the password. OCR converters can handle them perfectly, as they just “look at the pages”, not using their internal character coding.

Conversion problems

The conversion is not without problems, though. PDFs are very capricious creatures, indeed. They may come in all sorts of formats, from the usual letter-size page layout to complex (and big!) newspaper-like pages. I have often received “unfolded brochures” that contained four or more successive pages, one beside the other, forming one big PDF page, maybe 25 in. long. Often, Word cannot handle these unusual printing formats. Tables, columns and boxes all present problems. The conversion programs can do an admirable job of rendering these printed formats as visually attractive Word documents, but under the surface the converted files are nightmarish: they are full of strange styles, disparate measures, unconventional character and line spacing... just for the sake of mimicking as closely as possible the original document.

Often, if you try to process these documents with Trados or any other computer aided translation packet, the scene is ripe for disaster. It’s too much like a house of cards: nice to watch, but you cannot touch it. The unavoidable modifications that a translation entails will quite probably play havoc with these documents.

There is no hope, then? Not entirely: in difficult cases, I resort to manual extraction. Also, some extraction programs offer a menu of layout options for the converted file: you can select from the full recreation of the original appearance, to plain text extraction. I often find a middle-of-the-road selection satisfactory: I keep some typographical features (such as headers, bold type and italics), and suppress everything else

Regenerating the original PDF document

In most cases, a translator simply cannot (and should not) attempt this. I have seen some well-meaning attempts to deliver a “PDF-like” document, and they were failures. The client (or the agency) has to provide for proper DTP if necessary. And, believe me, the DTP-person will not thank you for any complex formatting in your document. More often than not, the very first thing that a DTP specialist has to do is to eliminate all complex formatting from the translated file, in order to be able to lay out it again.

However, in some special cases, you may provide a PDF document in order to protect your work, or as a sample (see below).

Using PDF to your advantage

Independently of the format of received document, you can use the Acrobat format to your advantage. Few translators own the full Acrobat “maker” program because it is relatively expensive (US$299 the “Standard version” from Adobe) and apparently there isn’t much use for it. Nowadays, there are many third-party PDF generators, which although they may not offer all the features of the Adobe product, are perfectly suitable for most jobs. Once again, you can find many of them by doing an Internet search on “PDF generator”. My personal favorite is PDF 995 at http://www.software995.com/, a very good program created by former Adobe employees.

And what do you want to generate PDFs for? Mainly for invoicing. Invoices in PDF are difficult to tamper with, can be printed as originals and many businesses accept them as equivalent to their printed counterparts. You can password protect them, so that they cannot be modified, and you can even sign them electronically. As with all electronic documents, they are easy to file and retrieve. You can also send budgets and bids in PDF format: Word documents or e-mail messages cannot give the kind of long-term security that a PDF offers.

For the same reason, in some cases you can also send PDF files as backup for Office documents or as an easily viewed complement to original DTP files (such as Quark Xpress or InDesign). Many users do not have the means to see these files, but they can easily check the PDF printed from them.

Source:http://www.proz.com/translation-articles/articles/480/1/Handling-PDF-Documents