This is a major revision of an article that I wrote in 2013. I’m developing a paperless workflow for my home and office. I want to save all my documents in PDF/A-1b PDF/A-2b archival format so I will be able to open them for years to come. The PDFs should be searchable, meaning they contain not only images of documents, but strings of text. This allows the documents to be indexed so I can quickly find documents when I type in Windows Explorer’s search box.
Note The 2013 approach created PDF/A-1b files, which do not support transparency. Increasingly, files that I receive include transparent fonts; when these were converted to PDF/A-1b, the fonts were rasterized, the files were no longer searchable, and the the files were very large. PDF/A-2b, with its support for transparency, solves all that.
There are basically three types of documents that need to be archived:
- Paper documents. These must be scanned and, in order for them to be searchable, have Optical Character Recognition (OCR) applied. I’ve found OmniPage 18 Standard to be pretty good at this.
- Non-PDF electronic documents like emails, web pages, etc. These already have text; they just need to be converted to PDF/A. I’ve already blogged about using CutePDF to print these to PDF/A.
- PDF documents. Once you opt out of paper statements, your bank, credit card company, telephone company, and utility will give you links to PDF files for download. Your tax software probably saved a PDF file too. You could re-print these to PDF/A using CutePDF, but I chose to write a batch file to quickly convert an existing PDF to PDF/A using Ghostscript. This batch process is the subject of this article.
Set up the Batch Components
Caveat This approach should create valid PDF/A documents, but even among experts, there is some disagreement about the PDF/A standard. Use this approach at your own risk. If you have Adobe Acrobat Professional, you can use its “pre-flight” validation to check the output. Or you may want to try a free online validator like the one at PDF-Tools.com. For more background on the process, see this this superuser article and this Ghostscript bug report. In brief testing with PDF/A-2b, I found that one of three test files failed validation, complaining about CMYK colorspace even though I specified RGB. Maybe the source document (an AT&T Internet bill) had some reference to CMYK.
The underlying technology for this batch file is the same as for the CutePDF process, so if you have already followed the other post, you can skip the identical steps.
1. Download the latest GNU Affero-licensed version of Ghostscript here (version 9.25 as of this writing) (version 9.54 as of April 22, 2021). I found that the 32-bit version works fine even under 64-bit Windows 7 or 10. Install Ghostscript but customize the directory so it doesn’t change if you get a later version, I use C:\Program Files (x86)\gs\latest
. At the end of the install, go ahead and let it Generate cidfmap for Windows CJK TrueType fonts.
2. Create an empty folder on your C: drive called C:\GS_PDFA (Ghostscript PDF/A).
3. Go to Control Panel > System and Security > System. Click on Advanced system settings then Environment Variables. Under System variables, highlight Path, click Edit and add C:\GS_PDFA to the Path (shown here in Windows 10):
4. Download PDFAbatch_1.5.zip and unzip it into C:\GS_PDFA. This will give you four files:
pdfa.cmd
– the batch filePDFA_def.ps
– the prefix file for Ghostscript conversion to PDF/APDF_ShowBookmarksPanel.ps
– a Postscript instruction to tell a PDF reader to show the Bookmarks Panel when opening the document.Release Notes.txt
Note that PDFA_def.sys
is the same file described in the CutePDF post, so it’s okay to overwrite it.
Update April 22, 2021 Some updates for compatibility with Ghostscript 9.54. See Release Notes.txt.
5. Locate the path to Ghostscript’s gswin32c.exe
on your system. pdfa.cmd
assumes it is in C:\Program Files (x86)\gs\latest\bin\
. If it is somewhere else, update line 66 of pdfa.cmd
to point to the correct path.
6. Download the Adobe ICC profiles here. An ICC profile describes a “color space.” We’ll use the simplest one, Adobe RGB (1998). From the downloaded zip archive, extract AdobeRGB1998.icc to the C:\GS_PDFA folder. Again, this is the same file used in the CutePDF post so it’s okay to overwrite it. (You can use a different profile, e.g. sRGB_IEC61966-2-1_no black_scaling.icc
from www.color.org; you’ll need to modify PDFA_def.ps accordingly.)
That’s it! You’re now ready to convert PDF files to PDF/A.
Use the Batch File
Since the batch file is in your path, you should be able to open a command prompt anywhere on your system, type pdfa <filename>, and watch it convert the file to PDF/A. Some notes and advanced usage:
- Do not type the .pdf extension on the input parameters. Just type the file name.
- If the file name contains spaces, enclose it in quotation marks.
- The batch program will rename the input file to .old.pdf and create the PDF/A as .pdf. You can delete the .old.pdf file(s) if you are satisfied with the new PDF/A document.
- You can concatenate up to five input PDFs into one output PDF/A. Separate the input file names with spaces.
- When conversion finishes, the PDF/A output file will open in the program on your computer that is registered for viewing PDF files (e.g. Adobe Reader).
- To set the Initial View of the PDF to show the Bookmarks (outline) panel, set the last parameter to -sb (show bookmarks). The input file must already contain bookmarks. Bookmarks will not work properly when concatenating files because bookmarks copied from later files will point to incorrect page numbers.
- Type pdfa by itself to see some usage notes.
Usage
pdfa file1 [file2^|-sb] [file3^|-sb] [file4^|-sb] [file5^|-sb]
Usage Examples
1. If you have a PDF utility bill, open a command prompt where the PDF file resides and use this command:
pdfa “Utility Bill”
Output
Utility Bill.pdf – the PDF/A document
Utility Bill.old.pdf – the original PDF document
2. If you have a credit card statement with two reconciliation reports to attach, use the following command:
pdfa CCstatement recon1 recon2
Output
CCstatement.pdf – the combined PDF/A document
CCstatement.old.pdf
recon1.old.pdf
recon2.old.pdf
3. If you have a tax return that includes bookmarks, use the following command:
pdfa “Tax Return” –sb
Output
Tax Return.pdf – the PDF/A document, should open with bookmarks panel in Adobe Reader
Tax Return.old.pdf
Add a File Explorer Context Menu
I use this so much that I needed a way to run the batch directly from File Explorer without having to open a command prompt. This turns out to be pretty simple to set up.
1. In File Explorer, go to %AppData%\Microsoft\Windows\SendTo.
2. Add a shortcut to C:\GS_PDFA\pdfa.cmd. Name it “PDFA Batch File”. (While you’re here, you might want to remove Send To items that you’ll never use.)
3. Now, in File Explorer, Ctrl-click to select up to five PDF documents in the order in which you want to concatenate them. Right-click on the first one and choose Send to > PDFA Batch File:
A command window will appear briefly as it converts the file(s), then the completed file will open in your default PDF viewer:
Reference
A few notes for future reference:
Official document on creating PDF/A:
https://www.ghostscript.com/doc/current/Ps2pdf.htm#PDFA
Notes on parameters to use for creating PDF/A:
https://bugs.ghostscript.com/show_bug.cgi?id=699582#c2
Notes on why transparent fonts produce non-searchable PDFs:
https://bugs.ghostscript.com/show_bug.cgi?id=692773#c3
In general, the best way to see if a Ghostscript problem has been reported and solved is to search the bug tracker at https://bugs.ghostscript.com/query.cgi. Change the Status to All to see open and closed bugs. Then search for the string “PDFA” (no slash). You can sort the results by Change date to see the more recent issues.
I am getting below error, Do you knoe why?
‘pdfa’ is not recognized as an internal or external command,
operable program or batch file.
Dilip, it sounds like the pdfa.cmd file is not in your path. See step 3 about adding the folder where you created pdfa.cmd to your computer’s path. You can also test pdfa.cmd directly by opening a command prompt and navigating to the folder where pdfa.cmd is located.
Dear sir,
I’m using with success the method you describe in:
https://www.mcbsys.com/blog/2013/04/batch-convert-pdf-to-pdfa/
But when I try to use your updated version at:
https://www.mcbsys.com/blog/2018/10/batch-convert-pdf-to-pdf-a-2018-edition/
I can’t find the file: C:\GS_PDFA\PDFWrite.pdfa.rsp nor see the part of the script where you generate.
Any suggestion to solve this problem?
Many thanks in advance.
Gusi – thanks for pointing that out. PDFWrite.pdfa.rsp is only needed by the CutePDF converter. I think I was trying to incorporate that into pdfa.cmd but it didn’t work. I’ve removed PDFWrite.pdfa.rsp from pdfa.cmd and uploaded a new version, PDFAbatch_1.4, now linked above.
Hi
Thanks for the instruction. It works flawlessly. Just wanted to ask if it’s possible to change the output directory?
Thanks
@Marco, changing the output directory is not a native feature of the script. From a quick glance, it looks like you could customize the “-o” (output) parameter where it runs gswin32c near the end of the script. You could try something like this:
-o "C:\MyPDFs\%file1%.pdf" ^
Hi Mark
Great script, thank you very much! On my Windows 10 machine this works up to Version 9.26 of Ghostscript, with 9.27 the conversion fails returning the following error message
Unrecoverable error: rangecheck in .putdeviceprops
— Full output: —
B:\fix_pdfs>pdfa P1_1_3_007_055
ren “P1_1_3_007_055.pdf” “P1_1_3_007_055.old.pdf”
GPL Ghostscript 9.27 (2019-04-04)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Unrecoverable error: rangecheck in .putdeviceprops
Die Datei “P1_1_3_007_055.pdf” kann nicht gefunden werden.
B:\fix_pdfs>
—
Regards,
Oliver
Oliver – Thanks for the report. Not sure what it means! Let us know if you figure it out.
try to use the whole path not just the filename
Thank you for your replies. I just tried and had no luck using it with GPL Ghostscript 9.50 (2019-10-15), whole file path as argument or no.
Hi Oliver – googling “rangecheck in .putdeviceprops”, I see it that has caused errors in the past, related to the -sOutputFile parameter. (I use the -o option which includes the output file specification https://www.ghostscript.com/doc/current/Use.htm#o_option.)
I noticed you are using a less-common B: drive . Do you have a drive C: or higher to try it on? Sometimes these kinds of things come down to unforeseen scenarios that were not tested by developers.
The next step would be to pull out the actual command from pdfa.cmd script (around line 235) and try it. If you can duplicate the error using gswin32c, you could raise a bug directly with GhostScript: https://bugs.ghostscript.com/.
I am now able to print a pdf/a-1b file with CutePDF using gs 9.25 with your scripts. Thank you very much. Can pdfa.cmd and the CutePDF scripts be modified for custom metadata such as: title, author, subject, keywords?
I tried manually editing the metadata and fixing with pdfa.cmd to pass validation, but the file becomes unsearchable for text because xref is messed up.
@Robert – mucking with the metadata is beyond anything I’ve ever attempted, so I’m afraid I can’t advise you on that. Good luck!
When using an option to save with orc the pdf / A becomes invalid. Any solution?
@Anderson, I’ve never heard of “orc” so I have no suggetsions.
Thank you for posting this and keeping it updated/responding so diligently! I have it working fine for single files but am having trouble with selecting multiple files. I set up the shortcut that you recommended and when I multi-select it renames everything to .old but only populates one of the files selected (the most recently selected). Any idea what might be happening here?
Thanks, B
@B – in case it’s not clear, the expected behavior on multi-select is that it _concatenates_ all the selected files into _one_ PDF. For example, when I balance my bank account, I print a reconciliation report to PDF. Then I concatenate the bank statement and the reconciliation report to keep them together. Is that the behavior you are seeing?
@Mark Ah, I see now. Clearly I didn’t read thoroughly or look at my output document closely! It’s working perfectly then. I was looking to batch convert into individual files (uploading research articles into a public repository) but your code is pretty clear (even to a non-programmer) and I love the “send to” integration. It will be fun to see if I can figure out how to modify it for my purposes. Thanks again for your diligence.
You could write a batch file call pdfa multiple time, e.g.
call pdfa.cmd document1
call pdfa.cmd document2
call pdfa.cmd document3
(Pretty sure you want the “call”–try it.)
If you have a lot of documents, a couple of options:
1. Run “dir /b > convertfiles.cmd” at a command prompt to get a “bare” list of files. Then edit convertfiles.cmd, using copy and paste (or search and replace) to insert “call pdfa.cmd ” before each one.
2. Write a batch or PowerShell files to loop through all the PDF documents in a folder, calling pdfa on each one.
Hi,
Batch Convert PDF is absolutely great!
I am trying to change the 5 pdf output to pdf/A, but we are a bad programmer…
…for my use, I would need to convert max 10 pdf to 10 pdf/A…
@Michal, sorry I don’t have a batch file for doing 10 at once. What I have done in the past is to do multiple passes, e.g. convert documents 1-5 (save as 1.pdf), then do 1, 6, 7, 8, and 9, which concatenates the last four to the new 1.pdf.
Hello Mark,
I modified the CMD a bit (the code is not pretty) and the conversion works for max. 9 files. Creates one output PDF/A for one input file.
I tried 10+ files and CMD doesn’t work (1 x 10 problem? X vs. XX value?).
Converting to PDF/A will destroy some fonts (oh my love Helvetica… :-) )
Abbreviated CMD (without notes) below:
[Editor’s note: this is the untested code that MONOLEMA updated. Copy and paste to pdfa.cmd to try it.]
@MONOLEMA, thanks for you efforts. I’ve pasted the code directly into the comment above.
Blog readers, please note that I have not tested MONOLEMA’s code. Please test it yourself!
As for the limit of 9, not sure, but you could try 01, 02, 03 … 10, 11, or you could switch to letters 7 8 9 A B C.
Hello Mark,
CMD only knows %0 – %9 values. The solution is SHIFT and loop and then the number of files is unlimited.
I am sending abbreviated sample functional code – only Loop and selection of PDF/A version and it recognizes when there is no *.pdf file and skips it, so it is not necessary to mark only *.pdf.
For one input PDF it creates 1 output PDF/A.
The loop can easily be applied to your original code for an unlimited number of incoming files.
(my code was already very specialized for my needs with wide options of settings for specific use, and because my English is bad and the target users are similar, I write notes and variables in Czech, so the code is probably unreadable :-) )
[Editor’s note: more untested code that MONOLEMA uploaded. I believe this will actually loop through all files in a folder, converting all to PDF, so be aware of that. Copy and paste to pdfa.cmd to try it!]