Examining a PDF File with qpdf (2024)

Jay Berkenbilt

·

Follow

17 min read

·

Nov 6, 2022

--

Examining a PDF File with qpdf (2)

This article explains how to examine a PDF file using qpdf. qpdf is a command-line tool and C++ library that performs content-preserving transformations on PDF files. You can learn more about qpdf from its manual or the qpdf website.

In this article, we discuss QDF mode, which has been part of qpdf since the beginning, and qpdf JSON format, which was introduced in qpdf version 11. This article shows you a couple of ways to examine the internals of a PDF file, and it lays the groundwork for a future post in which we will discuss modifying a PDF file with qpdf.

Editorial note: occasionally I will make an aside comment that goes a little deeper. When I do that, I mark it with the magnifying glass (🔍) symbol. You can safely ignore those comments without losing the flow of the material.

If you aren’t familiar with the overall structure of a PDF file, read this first: The Structure of a PDF File.

Usually when people are looking at a PDF, they are doing so with a viewer that shows the content represented by the code inside the PDF file, not the code itself. But what if you are writing software that generates PDF and want to look at the internals? What if you’re just trying to understand what’s going on in a PDF file at the code level? In that case, you may want to examine the internal code within the PDF file or even edit that code. The target audience for this type of inspection and modification is the developer who understands (or is attempting to understand) the PDF file format at a low level. If you want to learn how to inspect a PDF file at this low level using nothing more than a few command-line tools and a text editor, this article is for you.

This article assumes a pretty good deal of familiarity with PDF. I will assume that you are familiar with the basic structure of a PDF file but won’t assume deep knowledge of any part of the specification. I recommend my earlier blog post, The Structure of a PDF File for basic information about PDF syntax and overall structure.

This article focuses on qpdf since I am the author of qpdf. With the release of qpdf version 11, qpdf can generate JSON files from PDF and convert those files back into PDF. We will use jq to work with qpdf JSON files. I think jq is an amazing tool that needs to be in the toolbox of anyone who likes hacking around with files.

If you wanted to apply some of the techniques in this article more systematically, you could write code in a regular programming language. If you like C++, you could use the qpdf library directly. If Python is your thing, I recommend pikepdf, which is high-quality and well-maintained wrapper around qpdf. The pikepdf package exposes most of qpdf’s functionality and offers quite a bit of additional functionality beyond what qpdf offers directly. There are many other libraries for working with PDF files across a variety of programming languages. qpdf is very low-level. It is geared toward people who are comfortable working directly with PDF content. It does not offer higher-level abstractions for working with page contents, for example. You can find other libraries for that purpose.

Wouldn’t it be great if you could pull up a PDF file in a text editor and just start reading and editing it? Unfortunately, in most cases, you can’t really do this. There are several reasons for this:

  • Most of the “good parts” of a PDF are inside compressed binary data streams.
  • PDF files are collections of objects whose exact byte offsets in the PDF file are known. If you make an edit that changes the position of anything in the file, you need to update the cross reference table. This is, at best, tedious and error-prone, and at worst, completely impractical because the cross-reference information may itself be inside a compressed blob of binary data. While most PDF viewers are tolerant of some degree of error in the offsets, if the offsets are wrong, the file is not technically valid and may not even work.
  • The layout of PDF files is not always conducive to editing. Line endings may be inconsistent. It’s hard to find the thing you want. If you edit a stream, you have to update its length. Strings can contain arbitrary binary data. There are so many things that just make it tricky to edit a PDF file.
  • Even if you could easily edit the code that creates the contents of a page, most PDF files contain a bunch of drawing and text positioning operations that place things in exact locations on a page. Except in the simplest cases, text that appears on a PDF page is often unrecognizable in its representation in PDF code. I intend to discuss this in more depth in a future article.

The qpdf tool offers two solutions to this problem. Each solution has its place, and which solution you choose depends on what you’re trying to accomplish.

  • QDF mode: qpdf creates a “QDF” file, which is a completely valid PDF file that is optimized for viewing and editing in a text editor. The text editor must be able to work safely with files that contain a mixture of binary and text. I edit QDF files in emacs all the time. After editing a QDF file, a companion tool, fix-qdf, which comes with the qpdf distribution, updates stream lengths and the cross reference table. Use QDF mode if you have a file that is small enough to load into a text editor all at once (including all images, attachments, etc.) or if you want to work directly with PDF syntax. If you are an expert, this can be a quick way to make small updates to PDF files. QDF mode has been part of qpdf since its very first release.
  • qpdf JSON: Starting with qpdf version 11, qpdf is able to generate a JSON representation of a PDF file. Note that qpdf JSON is nothing more than an alternative syntax for representing the PDF file. It doesn’t do any of the hard work for text extraction, document structure, or anything else related to interpretation of the PDF file’s contents. If you don’t want to be bothered with PDF syntax or you want to work with JSON tools instead of PDF, this is a good option. JSON is much easier to edit than PDF since there are no concerns about offsets and lengths, and JSON files are pure UTF-8-encoded text. With qpdf JSON, it is also possible to work with extremely large PDF files since embedded streams can optionally be extracted to external files. If you want to iterate on editing a PDF file’s content streams or perform systematic transformations to PDF files, qpdf JSON may be a good choice. It may also be useful if you want to work with low-level PDF structure from a language that supports qpdf but doesn’t have bindings to the qpdf library and you don’t want to write glue code in C or C++, if you want to script certain transformations, or if you don’t have a good way to edit files that contain a mixture of text and binary.

Make sure you have a working copy of qpdf and jq. Both tools are widely available across many platforms. Linux users can install qpdf and jq using your native package manager (e.g., apt-get, yum, etc.). MacOS users can install these with brew. Windows users can use chocolatey. You can also download the tools directly:

The PDF file used in this article can be downloaded from my blog-files repository on GitHub. Here is a link to the file:

The sample PDF file contains two pages and was created using LibreOffice. The LibreOffice file is also available.

For the whole story about QDF mode, I refer you to the qpdf manual. Here’s what you need to know for this article:

  • QDF files are PDF files. They completely conform to the PDF specification. They have the following additional properties (among others):
    – Most streams are uncompressed for easy viewing.
    – Within content streams, newlines are normalized.
    – Stream lengths are always indirect objects.
  • Editing a QDF file in a text editor breaks offsets and lengths, but as long as you maintain the conventions described in the manual, the fix-qdf tool will restore them.

When I open sample.pdf in emacs using a mode suitable for editing binary files, I get something that looks like this (just showing the beginning of the file here):

Examining a PDF File with qpdf (3)

Even as a PDF expert, there’s not much I can do with this. Okay, I’ve seen enough of these that I recognize “x\234” as the beginning of a stream compressed with /FlateDecode, but that’s about it. (🔍Aside: there’s also something weird about the comment on line 2. PDF files are supposed to have a comment after the header containing binary data so that tools don’t mistake the file for a text file. But those bytes are valid UTF-8 encoding, which defeats the purpose. That’s probably a bug, but it doesn’t matter since there’s plenty of binary data. But never mind…I digress.)

We can convert this to a QDF file with a command like this:

  • qpdf --qdf sample.pdf sample-qdf.pdf

When I open the file now, the beginning of the file looks like text.

This is quite a bit better. There a few things to notice:

  • There is a comment that starts with %QDF-1.0. This is a marker that the fix-qdf tool uses to recognize this as a QDF file.
  • The objects are “pretty-printed” for easier readability
  • There are blank lines between objects. All objects start at the beginning of the line.
  • When qpdf rewrote the file, it renumbered the objects, but it wrote the original object ID in a comment for reference.
  • Page objects have comments indicating their page numbers so you can find them without having to traverse the pages tree manually. You can see an example on line 34 above.
  • (🔍Aside: I actually cheated a little here: the line 2 comment created by qpdf is actually 4 bytes of binary data that look like that when viewed as ISO-8859–1 encoding, but I converted them to UTF-8 in the embedded GitHub gist used for the above display for better visual effect.)

Searching ahead in the file for the comment “%% Page 1”, we can find the contents of page 1.

Here’s the entire dictionary for page 1:

Syntactically, it’s easy to read the page dictionary here. Of course, to understand it, you have to know a little more about the PDF spec, which is out of scope for this article, but if you are comfortable with PDF, this is easy to read. We can see the content stream for this page is at object 6 (see line 5). The content stream is long, but here’s an excerpt of the file with the content stream itself starting on line 8:

The visual appearance of the page is not at all evident here, but at least you can read the code. It’s a lot better than a compressed blob of binary data. Notice a few things:

  • The comment indicates that this is part of the contents of page 1 (line 1)
  • The length (line 5) is an indirect object
  • The actual content stream is readable. PDF content operators are terse and are primarily intended to be machine-readable, but if you understand them, they are readable. In particular the q and Q operators isolate changes to graphics state, and the blocks starting with BT and ending with ET are text blocks. You can’t actually read the text — more on that in a future article. (🔍In a nutshell, the characters in the text strings are indices into a subsetted font containing glyphs only for the characters that appear. They do not represent ASCII or Unicode code points. You need a lookup table to know what actual character each value corresponds to, and that lookup table is in a different part of the file and can even be omitted. Don’t worry — I’ll explain it better in a future article.)

In spite of these improvements, editing this file in a text editor is still challenging. You have to make sure the mix of text and binary is not going to be corrupted by the editor. You don’t want an editor that’s going to try to treat this like UTF-8-encoded text. That will break your PDF. (I have seen quite a few bug reports about broken PDFs that I can tell from inspection have been passed through some attempt at UTF-8-encoding of binary data!) Not to mention that the repeated process of editing the file, running fix-qdf, and reloading is a bit cumbersome. But it is possible, and I have hand-repaired many broken PDF files using this approach!

While it would be possible to continue along this path, I’m going to switch the focus of this article to working with the JSON representation of PDF files. Please consult the qpdf manual for additional notes on working with QDF files.

The qpdf manual has a chapter with a lot of information about qpdf JSON. I encourage you to read it if you need more depth than is provided by this article. Here I’ll cover some basics and walk through a few neat tricks.

Here’s the key point about qpdf JSON. JSON is a format that represents essentially the same types of data structures as the PDF object format. This makes it a great choice as a syntactic alternative for PDF. To be clear, the PDF file format is designed for efficient machine parsing, random access, and other stuff that JSON isn’t good for, but for human consumption or manipulation by people or code, JSON is a good choice. It’s easy to read and trivial to parse compared to PDF, and JSON tools are available in every mainstream programming language and even for the shell. Note that the only tool that understands qpdf JSON format semantically is qpdf. While the JSON file is semantically equivalent to a PDF file, you can’t open it up in a PDF viewer. I just wanted to make sure that was clear.

The PDF object structure and JSON both support strings, numbers, booleans, nulls, arrays, and “dictionaries,” though in JSON they’re called objects. Sadly, this creates a bit of confusion. In JSON, an “object” refers to a map of keys to values. In PDF, everything is an object, and a map is called a dictionary. Since this article is PDF focused, I will continue to refer to JSON objects as “dictionaries” for consistency with the PDF terminology.

The similarity in the kinds of data that PDF and JSON represent makes for a simple mapping between PDF objects and JSON:

  • PDF numbers, booleans, nulls, arrays, and dictionaries are represented as JSON numbers, booleans, nulls, arrays, and dictionaries (objects).
  • PDF strings, indirect object references, and names are all represented as strings in JSON with conventions for disambiguation.
  • Top-level numbered objects are represented as dictionaries. Non-stream objects are dictionaries containing a “value” key that points to the JSON representation of the PDF object. Stream objects are dictionaries with a “stream” key whose “dict” key is the stream dictionary. The stream’s binary data may be omitted, included as base64 in the “stream” object’s “data” key, or written to a file, in which case the path to the file is in the “datafile” key.

To disambiguate the different kinds of PDF objects that are represented as strings in JSON, we have the following rules:

  • Indirect objects are JSON strings of the form “o g R” where “o” and “g” are the object and generation numbers (e.g. “10 0 R”). This matches the PDF syntax for indirect objects.
  • Names are represented as strings that start with the slash character, e.g., “/Type”. There are some nuances about name canonicalization that I will omit, but the details are spelled out in the qpdf manual.
  • PDF strings that can be represented as UTF-8-encoded text, including all ASCII strings, have the form “u:utf-8-encoded string value”. For example, the π character would be represented as “u:π”. This is a little easier to read than <FEFF03C0>, which would be what you’d probably see in a PDF file.
  • PDF strings that are not representable as UTF-8 text have the form “b:nnnnnn…” where nnnnnn… is a hexadecimal representation of the bytes. So the three-byte string of binary characters 0x89, 0xab, 0xcd would be represented as “b:89abcd”.

As for the top-level structure, qpdf JSON has no need for a cross-reference table. Instead, the qpdf JSON representation has a top-level “dictionary” with the key “qpdf”, which points to an array of two elements. The first element is a dictionary with various metadata including the information that would appear in the PDF header. The second element is a dictionary containing all the objects in the file. The trailer dictionary’s key is “trailer”. All other objects have a key of the form “obj:o g R” where “o” and “g” again represent the object and generation number. This syntax makes it easy to search for the definition of an object that is referenced elsewhere in the file.

An example is worth 1,000 paragraphs, so here’s an example. Here is the minimal PDF file from the PDF structure blog post:

You can convert this to JSON using all the qpdf defaults with the command

  • qpdf --json-output minimal.pdf minimal.json

The result is the following JSON:

It should be fairly clear how to map from one to the other. Mostly I’ll leave the mapping as “an exercise for the reader,” but here are a few things to point out:

  • The PDF version is represented as a comment in the first line of the PDF file. In the JSON file, it appears as the “pdfversion” key of the first element of the “qpdf” array.
  • This minimal example has no UTF-8-strings, so you don’t see any examples of that. However the “/ID” from the trailer is a binary string.
  • Unlike with QDF mode, the JSON representation of the file does not go through any of qpdf’s PDF rewriting logic. As such, objects are not renumbered, so the original PDF objects and the JSON objects have the same object numbers.
  • Even though the stream data in the minimal example is ASCII, stream data in this default mode always appears base64-encoded in qpdf JSON. When saved to a file, stream data is not base64-encoded.

Now let’s take a look at a few objects from the sample.pdf used in the QDF mode part of this article.

Here is object 2 from the QDF file:

Note the following:

  • This is object 2 now, but we can see from the comment that it was originally object 130.
  • The /Creator and /Producer keys point to hex-encoded strings. In PDF, a Unicode string can be encoded as UTF-16 by prepending the character U+FEFF (which is a zero-width space and is therefore unlikely to appear at the beginning of a string) followed by the byte representation of the string encoded as big-endian UTF-16. If you have Unicode hexadecimal codes memorized, you can read this, but otherwise, it’s probably not terribly useful. (🔍The leading feff marks this as Unicode, and fact that every other pair of digits is 00 is a giveaway that this is UTF-16-encoded ASCII.)

Let’s take a look at the qpdf JSON version of this. I’ll need object 130. I could convert the whole file to JSON, but I can also just grab object 130 like this:

  • qpdf --json-output sample.pdf --json-object=130 out.json

This results in the following JSON file:

Here we can see object 130 as the dictionary with key “obj:130 0 R” in the second element of the “qpdf” array. Since it’s not a stream, there’s just a “value” key. It has the same keys as the PDF dictionary, but we can read the strings. Since they are Unicode strings (as all ASCII strings are), they appear as regular JSON strings prepended by “u:”. This is a lot easier to read.

Let’s use qpdf JSON mode to look at the contents of the first page of sample.pdf. To generate the JSON for this section, I used the following command:

  • qpdf --json-output sample.pdf --json-stream-data=file --json-key=pages --decode-level=generalized out/sample.json

The output is too large to include in its entirety here, but you can view or download the file from the git repository.

First, let’s unpack the command. The --json-output flag tells qpdf to write the output as JSON instead of PDF. The --json-stream-data=file option causes the data for each stream to be written to a file instead of included inline as base64. The --json-key=pages option causes additional metadata about the file’s pages to be included in addition to the objects in the file. Like the special comments in QDF mode, the pages metadata makes it easier to find information about a specific page without traversing the pages tree. The --decode-level=generalized option tells qpdf to uncompress all streams that are compressed with a generalized compression algorithm. This enables us to read the content streams in any text editor.

The output file out/sample.json will contain the complete JSON output. Each stream will be named out/sample.json-n where n is the object number of the stream. In this case, the streams will be uncompressed when they were not originally compressed with a specialized algorithm (such as JPEG compression).

Let’s look at just the pages part of the JSON file. This was generated with

  • jq .pages < out/sample.json

There are a lot of things we could learn from this, but I will refer you to the qpdf manual for details about what everything means. For our purposes, we are interested in the content stream. You can see on line 4 that the contents of page 1 are in object 2. Let’s look at object 2.

  • jq -r '.qpdf[1]."obj:2 0 R"' < out/sample.json

(Note: there are four quote characters in the above command: a single quote after jq -r, a double quote before obj:, and a double quote followed by a single quote after R.)

This results in the following:

Simple enough. This stream has an empty dictionary. qpdf JSON files omit the /Length key since it is not useful here, and this removes any suggestion that the user has to keep it accurate. Rather than a base64-encoded stream in “data”, we have a file name in “datafile”. We can view this file, and we can also modify it and put its contents back into the PDF.

Here are the first few lines of out/sample.json-2:

This is the same as we saw back in the QDF mode example.

One of the things you can do with qpdf JSON format is to answer certain questions about the PDF that are otherwise hard to answer. For example, let’s say you wanted to get the sizes of all the streams in a file. With the output that we generated using --json-stream-data=file, this becomes trivial — just look at the sizes of the stream data. But you could also do it if the streams were inlined. Try this:

What’s all that? Let’s take it apart line by line.

  1. General qpdf JSON to standard output
  2. run jq -r to output results as text rather than JSON
  3. grab the objects dictionary
  4. convert from a dictionary to an array of entries
  5. for each entry…
  6. keep only the ones that are streams
  7. concatenate the key (object ID) with the length of the base64-decoded data field.

That generates the following output:

Okay, I’ll admit that’s a bit involved, but you have to admit it’s pretty powerful. If you’re in a position where you have to analyze PDF files, this can be a real help. In my day job, I sometimes have to answer questions like, “What is the largest content stream or image in this 100,000 page PDF file?” qpdf JSON and jq provide a pretty handy way to answer those kinds of questions with relatively little effort compared to writing a special-purpose program to gather the information.

A significant capability of qpdf JSON mode is the ability to go back from JSON to PDF. This makes it possible to make modifications in JSON or to modify stream contents in stand-alone files and to then reconstruct the PDF file. This can be a lot easier to work with than editing QDF files in an editor. The JSON file used to reconstruct a PDF can be a stand-alone JSON file containing the complete contents of the PDF, or it can just contain objects that you want to modify. This makes it possible to make small changes to very large PDF files without having to ever load the whole file into memory.

In future articles, I will discuss how to use qpdf JSON to modify a PDF file, and I will use this sample file along with some modifications to the content stream to explain how to interpret the text blocks that appear in the file.

I hope you found this article to be informative. If there are other PDF-related topics you’d like to read about, please let me know in the comments.

Examining a PDF File with qpdf (2024)

FAQs

How to review a PDF file? ›

How To Annotate and Review PDF Files
  1. Highlight, Underline and Add Squiggly Lines to PDF Text.
  2. Use Watermarks and Stamps to Designate the status of a Business document.
  3. Add Strikeouts to update collaborators on deleted text.
  4. Signify textual changes by inserting a caret.

How to check if a PDF file is valid? ›

How to validate PDF/A files
  1. Choose or drop the PDF/A file you would like to validate.
  2. A notification pops up that shows if your file is a valid PDF/A.
  3. If the file is not a valid PDF/A, you can click on 'Details' to see the reasons.

How to check if a PDF is readable? ›

How to test: Try selecting text using a mouse, or select all text using Edit > “Select All” from the Acrobat menu. If No, this is an image file and is not accessible. Covert to text using View > Tools > “Recognize Text.”

How to check a PDF for errors? ›

Click the “Analyze” button, and Adobe Acrobat will check the PDF file for syntax errors and other issues. Use an online PDF validator: There are several online tools available that can check if a PDF file is corrupted or not. One example is the “PDF Validator” tool provided by the PDF Association.

Can I inspect a PDF file? ›

PDF Inspector is an app to inspect the structure of a PDF file. PDF files are made up of a series of indexed entries that describe the file, often in the form of dictionaries. Drag and drop a PDF file from your desktop to the app to inspect a new file. The app opens with a sample PDF.

Is it possible to read a PDF file? ›

Adobe Acrobat Reader is user-friendly and can be accessed anywhere. You can easily view and print PDFs, as well as comment and share when needed. View, sign, collaborate, and annotate PDF files with ease. The Adobe Acrobat Reader free software puts efficiency at the forefront.

How do I verify data in a PDF? ›

Add verification information: Select appropriate options under 'Verification information' to add verification information to the signed PDF or to alert the user when the verification information is too large.

How do I digitally verify a PDF? ›

Open the PDF file in PDF Converter Professional. Left-click on the Digital Signature field. Click "Verify Signature".

How do you test if a PDF is OCR? ›

How do I know whether my PDF is OCRed? If you can easily select a line of text and then copy and paste it elsewhere, and the pasted text is properly formatted, your PDF is OCR-optimized and you can start annotating.

How to convert non-readable PDF to readable? ›

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.

How do you check mistakes in PDF? ›

Navigating to the Spell Check Option: Go to 'Edit' > 'Check Spelling' > 'In Comments, Fields, & Editable Text. ' Customizing Spell Check: Adobe Acrobat allows you to select the dictionary language and even add words to a custom dictionary, ensuring that industry-specific terminology doesn't get flagged as errors.

How to analyze a PDF file? ›

To analyze a PDF file, there are several methods available. One approach is to use a PDF file analysis method that involves identifying keywords in the file, determining the content type, and matching it with analysis templates to generate analysis data .

How to validate a PDF file? ›

Upload the PDF/A file you wish to validate by either dragging and dropping it into the upload box or by browsing for it on your computer. Uploading a file from the Internet (Enter URL) or cloud storage services (such as Google Drive and Dropbox) is also possible.

How do I review and edit a PDF? ›

How to edit PDF files:
  1. Open a file in Acrobat.
  2. Click on the “Edit PDF” tool in the right pane.
  3. Use Acrobat editing tools: Add new text, edit text, or update fonts using selections from the Format list. ...
  4. Save your edited PDF: Name your file and click the “Save” button. That's it.

How do I review a PDF without opening it? ›

Use the Preview Pane: The Preview Pane feature in WPS Office provides a thumbnail preview of the PDF files in a folder, allowing users to quickly view their contents without opening them individually.

How do I make a PDF file readable? ›

To make a PDF searchable using Adobe Acrobat, you can follow these steps:
  1. Open Adobe Acrobat on your computer.
  2. Click Open.
  3. Find and select the document you want to make searchable, then click Open.
  4. Head to Tools and select Recognize Text.
  5. Press PDF Output Style Searchable Image.
  6. Select OK.

How do I review changes in a PDF? ›

View change information in the Track Changes panel
  1. Choose Window > Track Changes.
  2. Click the insertion point in a change. The Track Changes panel displays the date, time, and other change information.

Top Articles
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 6027

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.