Creare sito web

Stop Censorship

Live chat

Click here to chat with Qabiria!

Skype us

My status

Twitter Updates

qabiria: RT @erik_hansson: Translators! Welcome to have a closer look at our Facebook page http://t.co/IdoiHDSu #xl8 #translation
qabiria: Bilingual People - Specialist Language Recruitment and Expat Fair - Feb 11, 2012 in Barcelona, http://t.co/YblQLQQL
qabiria: Lionbridge Announces Q4 and FY 2011 Results; Grows Q4 Revenue $7.0 Million and Expands GAAP Earnings by $5.6 Mil... http://t.co/QtsAcm6i

Request a quotation

If you would like us to provide a proposal for any project you may have, please use the contact form.
BC_64px
img_10.jpg
How to count and translate PDF files Print
(11 votes, average 4.45 out of 5)
Written by Marco Cevoli   
Saturday, 11 October 2008 19:00

When you must translate a PDF file, there are different options for converting it to an editable format, depending on its structure.

P

DF files are one of the translators' worst enemies. A PDF file must be converted into a editable format in order to be analysed or translated with a CAT program. Based on the PDF type, the conversion can be more or less difficult - sometimes impossible.  The rapid identification of the PDF type is the prerequisite for choosing the most appropriate conversion process, selecting the best tools and save time.

PDF stands for Portable Document Format. PDF was developed by Adobe in 1993 for representing two-dimensional documents in a manner independent of the application software, hardware, and operating system.  Basically a PDF is always viewed and used in the same way, no matter what computer is used. Thanks to these characteristics, PDF has become one of the preferred formats for sharing documents. For many people creating a PDF version of a document is like making a "virtual photocopy". While this method is very handy, it presents several disadvantages when the document needs to be modified or translated. A PDF document is composed of different elements. Some of them are independent from the visible text such as the text properties (author, title, etc.) Others are parts of the text and generally include: Text, bitmap images (pictures), vector graphics (lines, diagrams, etc.) It is important to determine whether the document we are viewing is a text; in this case, it can be selected. To find out, you only need open the document with Adobe Reader (or any other PDF viewer) and click on the Select Text icon in the toolbar. Alternatively, you can zoom the document. If at some point the text appears out of focus or badly printed, the document is a scan. On the contrary, if the text can be selected or if its resolution does not deteriorate when zooming, the PDF was generated by an application. In order to identify which application was used the PDF, you can press CTRL+D (or select File / Document / Properties) and read the file description tab. Under Application you should see the name of the program used to generate the PDF.

Ideally, at this point you should ask your client for the editable file, as you have just confirmed its existence.. A good way to persuade the client is applying an extra charge for converting the PDF file. Obviously, the approach depends on the relationship with the client and/or the specific project.  To tell the truth, it may happen that the client - especially if it is a multinational organisation - does not have the editable file. Indeed, DTP is often managed at Headquarters and local branches only receive PDF files for printing. The need for translation may only arise at this stage. In this case, finding the original document can be very difficult.

If despite all efforts it is not possible to obtain the original file, there are various options on how to export the text. At this point we should warn that none of the export options will allow obtaining a file identical to the original (including fonts) - in particular when it contains bitmap images and certain types of formatting . The choice of the export method, as well as the degree of accuracy, will also depend on the text intended use.  There are two possible situations:
  1. The text is only needed for word count or analysis
  2. The text must be editable and as close as possible to the original.
In the first case, exporting the file is not necessary. If the PDF is a text (as previously described), one of the following tools can be used: If wordcount must be performed on a document that is less than 1 MB in size, no program needs to be installed. Word count can be executed using a free online tool
If you cannot or prefer not to use this software and if you have Adobe Acrobat (not Acrobat Reader) you can export the text as following:
  • Open the PDF file with Adobe Acrobat
  • In the File menu, save the document as RTF or DOC
Based on the document type, you may need to apply this Word macro to restore the correct paragraph structure and spacing

If you do not have Adobe Acrobat:
  • Open the file with Adobe Reader
  • Select text
  • Select all the text (CTRL+A)
  • Copy (CTRL+C)
  • Open Word or any other text editor
  • Paste the text (CTRL+V)
Of course, this operation can be performed on a section of the document only. If you wish to preserve the format, there are two options: you can either use one of the several programs that convert PFD into Word files, or use an OCR program (FineReader, OmniPage, ReadIris, etc  In general, we do not recommend using programs that convert automatically without the user's intervention. These programs usually generate Word documents preserving the original PDF appearance by means of very complex formatting in terms of text frames, section breaks, columns, fonts and spacing. As soon as you start editing the document - for instance by deleting a sentence or opening the text with a CAT program - formatting is lost and it is virtually impossible to work on the document. For these reasons, we recommend converting the document with an OCR program (the best we could find was Abbyy FineReader), manually modifying the default settings and choosing the distribution of the various elements on the page. For further information on how to use FineReader, plese read the article «Optical character recognition with Abbyy FineReader».

In the case where a file needs not only preserving its format, but also being entirely reproduced (provided that the source file is not available), there are two possibilities:

  1. You can either use a DTP program (Quark, InDesign, etc.), with the original PDF as a model in the background. For more information we invite you to read this article: http://www.proz.com/translation-articles/articles/560/1/Translation-and-DTP-of-a-PDF-File
  2. Or, you can use Infix, a PDF editor distributed by Iceni.

 

Infix Professional (at about 160 $) has a useful feature to export the text content of a PDF in XML format. The resulting XML file can be processed by a CAT tool and translated (e.g. OmegaT, since version 2.3.0 which is equipped with a filter to directly open this file type, as per this in-depth tutorial from OmegaT website). Then Infix Professional can reimport the translated file into the original PDF. The Infix website shows the whole procedure in a self-explaining video.

Those who do not want to purchase an OCR program or only need it occasionally can use one of the many online convertors, such as Zamzar (http://www.zamzar.com).

As already stated, what we have explained so far only applies to PDF files that are generated by an application. When the PDF text is an image (this is typically the case of a scanned fax), the only way to export it to an editable format is using an OCR program.

The eventual document protection settings represent an additional complication. In fact, two protection levels can be activated using a "user password" and an "owner password". The "user password" prevents the document from being opened. The "owner password" restricts access to one or more functions such as print, copy, modify, insert notes, etc. If the PDF author restricts access to functions using a password, the methods described above cannot be used. You must contact the client and ask for the password. If this is impossible, you should be aware that there are several tools to decipher "owner passwords". You only need search "PDF crack" on Google. You can also use online programs such as http://www.ensode.net/pdf-crack.jsf). The situation is more complicated when the "user password" prevents the PDF from opening; in this case it is only possible to use intrusive software that may take hours or even days before deciphering the password. Please note that the use of this software may infringe property rights. Qabiria does not promote their utilisation by any means.

Trackback(0)

TrackBack URI for this entry

Comments (8)

Subscribe to this comment's feed
...
0
articolo utile e chiaro. grazie.
Paolo , February 11, 2009
...
0
ciao, mi e' piaciuto molto, anche se e' un po' lunghetto... i link esterni sono comunque molto interessanti.
elisa , February 13, 2009
...
0
smilies/smiley.gifVeramente omnicomprensivo e chiaro - grazie!
Elizabeth Hill , November 10, 2009
...
0
Un artículo escrito con claridad y muy completo. Felicidades. Pienso que, sin embargo, los traductores deberíamos ofrecer esta conversión como un servicio añadido. Hay un artículo muy interesante al respecto en la web de unos traductores australianos (a ver si encuentro el enlace). Dado que el volumen de trabajo que implican estas conversiones es elevado, deberíamos poner al cliente en la disyuntiva de recibir el texto sin formateos o de pagar por la conversión. La forma: al recibir el encargo convertir 1 ó 2 páginas y enviárselas al cliente diciéndole que por "x euros" más puedes entregarle la traducción formateada casi igual que el original. Cuando el cliente ve el "x euros más" le falta tiempo para buscar el archivo fuente. Si tiene mucho interés en la conversión y no dispone del archivo original, que pague.
Para mi propia vergüenza, no me he aplicado el cuento y sigo convirtiendo documentos sin cobrar por ello a mis clientes.
Michael Martí , January 24, 2010
...
Marco Cevoli
Gracias, Michael. Efectivamente, muy a menudo el simple hecho de mencionar un "recargo por conversión" tiene el efecto de que aparezcan de la nada los archivos fuente que generaron el PDF...
Marco Cevoli , January 24, 2010
...
0
For free you can use gDoc Creator to convert pdf files to word. One of the convert to Word options in the software is to retain text flow so that it is easily editable. It may be of use to you and I would be interested in your comments about it. Here's a link to the product page: http://bit.ly/5SFT2h
Graeme Huttley , February 04, 2010
...
Marco Cevoli
Thanks a lot for sharing the information, Graeme. Actually, there are dozens of programs that claim to easily convert from PDF to Word. However, the scope of this article is just the opposite. We weren't looking for a "quick and dirty" solution, but for the better way of producing a Word document while keeping in control of the format during the conversion. From our experience, the only way to achieve this is using the advanced features of plain OCR software, not out-of-the-box solutions.
Marco Cevoli , February 04, 2010
...
0
Has anyone tried Infix for searchable PDFs? http://www.iceni.com/infix-Translate.htm

Just wondering...
gbcuxknu , April 19, 2011

Write comment

smaller | bigger

busy
 

Download our toolbar

Download Qabiria's toolbar
  • One-click search on Google, Wikipedia and many dictionaries
  • Useful links for translators
  • Online tools to increase productivity
  • Learn how to use the toolbar in 5 minutes on Qabiria channel on YouTube