Note This post is outdated . See the 2018 version here.
I’m developing a paperless workflow for my home and office. I want to save all my documents in PDF/A-1b archival format so I will be able to open them for years to come. The PDFs should be searchable, meaning they contain not only images of documents, but strings of text. This allows the documents to be indexed so I can quickly find documents when I type in Windows Explorer’s search box.
There are basically three types of documents that need to be archived:
- Paper documents. These must be scanned and, in order for them to be searchable, have Optical Character Recognition (OCR) applied. I’ve found OmniPage 18 Standard to be pretty good at this, except for the annoying bug that white-on-black text (often used in column headings of printed documents) disappears.
- Non-PDF electronic documents like emails, web pages, etc. These already have text; they just need to be converted to PDF/A. I’ve already blogged about using CutePDF to print these to PDF/A.
- PDF documents. Once you opt out of paper statements, your bank, credit card company, telephone company, and utility will give you links to PDF files for download. Your tax software probably saved a PDF file too. You could re-print these to PDF/A using CutePDF, but I chose to write a batch file to quickly convert an existing PDF to PDF/A using Ghostscript. This batch process is the subject of this article.
Set up the Batch Components
Caveat This approach should create valid PDF/A documents, but even among experts, there is some disagreement about the PDF/A standard. Use this approach at your own risk. If you have Adobe Acrobat Professional, you can use its “pre-flight” validation to check the output. Or you may want to try a free online validator like the one at PDF-Tools.com or the one at intarsys.de (German). For more background on the process, see this this superuser article and this Ghostscript bug report.
The underlying technology for this batch file is the same as for the CutePDF process, so if you have already followed the other post, you can skip the identical steps.
Update August 15, 2017 The pdfa.cmd
batch file has been updated to better handle installation as a Send To extension in Windows Explorer. The new batch file is included in the PDFAbatch_1.2.zip download below. See the Release Notes in the batch file for more details.
1. Download the GNU Affero-licensed version of Ghostscript 9.20 here. I found that the 32-bit version works fine even under 64-bit Windows 7. Install Ghostscript to the default directory, C:\Program Files (x86)\gs\gs9.20
. At the end of the install, go ahead and let it Generate cidfmap for Windows CJK TrueType fonts.
Note It should be fine to use a later version of Ghostscript; you’ll just need to modify the gs_path variable at the top of the pdfa.cmd
file.
2. Create an empty folder on your C: drive called C:\GS_PDFA (Ghostscript PDF/A).
3. Go to Control Panel > System and Security > System. Click on Advanced system settings. Add C:\GS_PDFA to end of the Path statement (System environment variable):
4. Download
and unzip it into C:\GS_PDFA. This will give you three files:
pdfa.cmd
– the batch file
PDFA_def.ps
– the prefix file for Ghostscript conversion to PDF/A
PDF_ShowBookmarksPanel.ps
– a Postscript instruction to tell a PDF reader to show the Bookmarks Panel when opening the document
Note that PDFA_def.sys
is the same file described in the CutePDF post, so it’s okay to overwrite it.
5. Locate the path to Ghostscript’s gswin32c.exe
on your system. pdfa.cmd
assumes it is in C:\Program Files (x86)\gs\gs9.20\bin\
. If it is somewhere else, e.g. if you have a different version of Ghostscript, update line 60 of pdfa.cmd
to point to the correct path.
6. Download the Adobe ICC profiles here. An ICC profile describes a “color space.” We’ll use the simplest one, Adobe RGB (1998). From the downloaded zip archive, extract AdobeRGB1998.icc to the C:\GS_PDFA folder. Again, this is the same file used in the CutePDF post so it’s okay to overwrite it. (You can use a different profile, e.g. sRGB_IEC61966-2-1_no black_scaling.icc
from www.color.org; you’ll need to modify PDFA_def.ps accordingly.)
That’s it! You’re now ready to convert PDF files to PDF/A.
Use the Batch File
Since the batch file is in your path, you should be able to open a command prompt anywhere on your system, type pdfa <filename>, and watch it convert the file to PDF/A. Some notes and advanced usage:
- Do not type the .pdf extension on the input parameters. Just type the file name.
- If the file name contains spaces, enclose it in quotation marks.
- The batch program will rename the input file to .old.pdf and create the PDF/A as .pdf. You can delete the .old.pdf file(s) if you are satisfied with the new PDF/A document.
- You can concatenate up to five input PDFs into one output PDF/A. Separate the input file names with spaces.
- When conversion finishes, the PDF/A output file will open in the program on your computer that is registered for viewing PDF files (e.g. Adobe Reader).
- To set the Initial View of the PDF to show the Bookmarks (outline) panel, set the last parameter to -sb (show bookmarks). The input file must already contain bookmarks. Bookmarks will not work properly when concatenating files because bookmarks copied from later files will point to incorrect page numbers.
- Type pdfa by itself to see some usage notes.
Usage
pdfa file1 [file2^|-sb] [file3^|-sb] [file4^|-sb] [file5^|-sb]
Usage Examples
1. If you have a PDF utility bill, open a command prompt where the PDF file resides and use this command:
pdfa “Utility Bill”
Output
Utility Bill.pdf – the PDF/A document
Utility Bill.old.pdf – the original PDF document
2. If you have a credit card statement with two reconciliation reports to attach, use the following command:
pdfa CCstatement recon1 recon2
Output
CCstatement.pdf – the combined PDF/A document
CCstatement.old.pdf
recon1.old.pdf
recon2.old.pdf
3. If you have a tax return that includes bookmarks, use the following command:
pdfa “Tax Return” -sb
Output
Tax Return.pdf – the PDF/A document, should open with bookmarks panel
Tax Return.old.pdf
Update November 22, 2016:
Add a File Explorer Context Menu
I use this so much that I needed a way to run the batch directly from File Explorer without having to open a command prompt. This turns out to be pretty simple to set up.
1. In File Explorer, go to %AppData%\Microsoft\Windows\SendTo.
2. Add a shortcut to C:\GS_PDFA\pdfa.cmd. Name it “PDFA Batch File”. (While you’re here, you might want to remove Send To items that you’ll never use.)
3. Now, in File Explorer, Ctrl-click to select up to five PDF documents in the order in which you want to concatenate them. Right-click on the first one and choose Send to > PDFA Batch File:
A command window will appear briefly as it converts the file(s), then the completed file will open in your default PDF viewer:
Pingback: Use CutePDF to Print to PDF/A for Free | MCB Systems
Thanks for providing this.
Though I use Linux to create PDF/A I was FINALLY able to perform valid PDF/A with the magic of the long line. No more “the value of the key N is 4 but must be 3” error
By converting the pdf “manually” via your tutorial, the pdf is converted to PDF/A. However, if I use the “Send To-shortcut”, a windows opens and closes really quickly, but the file is untoched. Any thoughts?
@Luuk, try adding the word “pause” as the last line in the batch file. That should force it to leave the black window on the screen when it finishes. Then you can examine the messages and see what’s wrong. I’m thinking maybe the program is not accessible when running from the context menu…
@Mark,
Tank you for your reply. I added “pause” in the batch file, now the windows shows up. It says: “C:\*path_to_pdf*\filename.pdf.pdf” not found. Exiting.
I don’t know why it shows the file extension twice. If I recall correctly, when converting the files manually with the tutorial above, I should leave the file extension, could this be the reason? Any ideas on how to fix this?
@Luuk, I found that I had made, but not yet published, some enhancements related to Send To handling, specifically stripping the file extension before appending .pdf. I’ve published the updated file above in PDFAbatch_1.2.zip. Note that you’ll need to edit line 60 to point to whatever version of Ghostscript you have installed (I’m now on 9.20). Does that fix it?
Hi Mark,
Thanks you, now it works! I was wondering, would you be able to help me with the following? If I select a maximum of 5 PDF’s, it ads all the PDFs in one new file. I prefer selecting more files, let’s say max 15, and converting them to PDF/A, but I want to save them per file. Is this possible?
Luuk, glad it’s working now. The script as written concatenates PDFs so no, you can’t use it to select many files and have a separate PDF for each file. If you have a lot of files, you could write your own script to call this script repeatedly, e.g. (untested):
call C:\GS_PDFA\pdfa.cmd C:\path\file1.pdf
call C:\GS_PDFA\pdfa.cmd C:\another.path\file2.pdf
…
Mark, your post has been very helpful: The US Patent Office requires uploading PDF copies of foreign patent documents, and rejects any that use non-embedded fonts. Finding a way to quickly turn non-compliant PDF copies into compliant ones has long been on my to-do list, and your post was inspiring. Here is a modified version of your script that I am using which is a bit more optimized for my right-click-send-to use case.
echo.
echo Converting selected PDF files in directory to PDF/A
echo.
for %%i in (%*) do (
echo Processing file %%i with extension %%~xi
IF “%%~xi” == “.pdf” (
ren %%i _old_%%~nxi
REM Convert to PDF/A. Use name of first input file as name of output file.
“%gs_path%\gswin32c” ^
-dPDFA ^
-dNOOUTERSAVE ^
-sProcessColorModel=DeviceRGB ^
-sDEVICE=pdfwrite ^
-o “%%i” ^
-dPDFACompatibilityPolicy=1 ^
“C:\Program Files (x86)\gs\PDFA_def.ps” ^
“%%~dpi_old_%%~nxi”
)
)
)
:END
exit
Axel – thanks for sharing!
Mark,
Just performed the complete install, works like a charm, no problems whatsoever.
My knoledge of scripting is very poor, but i would like to convert a full folder (using the send-to) function to PDFA.
When I try it now, it deletes original files (1st one always), but does not work over one file.
Most likely due to concatenates PDF’s
Could you help me with solving this ?
Thanks in advance
Luc Bodson
Hi Luc, glad you got it working. Sorry, it’s not designed to work on folders. You could try writing a script that calls pdfa.cmd for each file in the folder (see my August 16, 2017 comment above). I also offer custom programming on a consulting basis.
Thanks Mark, for the info, hope to get the batch routine working for folder work.
Take care,
Luke
Hi Luc,
In the meantime I manged to develop a script that converts each PDF file in a specified folder to PDFa and copies it to another specified folder. Feel free to contact me at [email removed] so I can send you the script and some instructions.
Thanks Luuk,
Just send you a PM contact.
Best regards
Luc
Gald you guys connected. Now that you have, Luuk, I’m removing your email address from your comment to keep it away from spam bots.
I finally figured out what has been going on with PDF/A conversion: semi-transparent fonts were getting rasterized, causing fonts to go fuzzy and files to lose searchability and bloat in size. Recent versions of Ghostscript can handle PDF/A-2b, which supports transparency, which solves all that. See the rewritten post here:
https://www.mcbsys.com/blog/2018/10/batch-convert-pdf-to-pdf-a-2018-edition/