A PDF is not a single thing. It is a binary container with a specific structure on disk, a set of content layers stacked inside that container, and a family of ISO standards built on top of the base specification. Open one in a hex editor and the structure is visible in plain text on the first few lines. Open one in a viewer and the layers render together as a single page.

This guide is a labeled reference: the physical file structure, the content layers a body holds, the metadata that surrounds everything, and the standards (PDF/A, PDF/X, PDF/UA, PDF/E, PDF/VT) that constrain it for specific use cases.

Anatomy of a PDF file: header, body objects, cross-reference table, trailer, content layers, and metadata

Want to use this diagram on your blog? Copy this embed code:

The Four Physical Sections

Every PDF on disk has the same four-part structure, in this order:

1. Header

The first line of the file. Always starts with %PDF- followed by a version number:

%PDF-1.7

Versions range from 1.0 (released 1993) through 2.0 (released 2017, current). The header is followed by a comment line with binary bytes that signal to FTP and other transport tools that this is a binary file.

2. Body - Indirect Objects

The bulk of the file. Every page, font, image, annotation, and form field is a numbered indirect object:

1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
 
2 0 obj
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
 
3 0 obj
<< /Type /Page /Parent 2 0 R /Contents 4 0 R /Resources << ... >> >>
endobj

Each object has an ID (the number before 0 obj), a generation number (0, used for incremental updates), and a payload between << and >> for dictionaries or stream and endstream for binary streams (image data, font data, compressed content).

Objects reference each other with the <id> <gen> R syntax (e.g., 3 0 R means "object 3, generation 0"). This is how a page references the font it uses, or how a catalog references the root of the page tree.

3. Cross-Reference Table (xref)

A byte-offset lookup table. For every object in the body, the xref records its absolute byte position in the file:

xref
0 6
0000000000 65535 f
0000000017 00000 n
0000000089 00000 n
0000000172 00000 n
0000000299 00000 n
0000000453 00000 n

This is what makes PDFs random-access. A viewer can read the xref, jump straight to the byte offset of object 3, and render that page without parsing the rest of the file. It is why PDFs of a single chapter open instantly even when the source file is 500 pages.

4. Trailer

The last section. Tells the parser where to find the xref and which object is the root:

trailer
<< /Size 6 /Root 1 0 R /Info 7 0 R >>
startxref
1893
%%EOF

The startxref value is the byte offset of the xref table. The %%EOF marker is the literal end of the file. Trailers are what make incremental updates possible: appending a new xref + trailer at the end lets you add objects without rewriting the whole file.

The Six Content Layers

Inside the body, content is stored across six layer types. Every rendered PDF page is a composite of these layers:

1. Text

Glyph position commands and font references, not text strings. A PDF stores instructions like "draw glyph 42 from font F3 at position (120, 540)" rather than "draw the letter A here." This is why text is selectable and searchable: the viewer reverse-maps glyph IDs to Unicode code points via a ToUnicode mapping (or a CMap for CJK fonts).

When text is missing a ToUnicode mapping, you get the classic "PDF with selectable text that copies as garbage" problem. The text is visible, but the glyph-to-Unicode mapping is broken or absent.

2. Images

Stored as embedded streams in one of several formats:

JPEG (DCTDecode filter): photographs, most common
JPEG2000 (JPXDecode): higher compression, less common
PNG-equivalent (FlateDecode + Predictor): screenshots, line art
CCITT Group 4 (CCITTFaxDecode): black-and-white scanned text, used in archival scans
JBIG2 (JBIG2Decode): bilevel images, common in OCR'd documents

Images can be downsampled, recompressed, or replaced without affecting other content.

3. Fonts

Embedded as full font programs, subset (only used glyphs included), or referenced by name (must be installed on the viewer's system). Subsetting is the default - it cuts file size dramatically. Supported font formats: Type1, TrueType, OpenType, and CIDFont (for CJK).

When a font is referenced but not embedded and not installed on the viewer's system, the viewer substitutes a similar font - which usually looks wrong. PDF/A requires all fonts be embedded to prevent this.

4. Annotations

Highlights, comments, links, stamps, watermarks, and form fields are all annotations. They are layered over the page content and can be added, edited, or removed without changing the underlying page.

Form fields are a special case: an interactive widget annotation (the visible part) plus a field dictionary (the data part). When you fill out a form and save, only the field dictionaries change - the page itself is untouched.

5. Vector Graphics

Lines, shapes, curves, and paths drawn with PostScript-like operators (moveto, lineto, curveto). Scale infinitely without quality loss. Most CAD exports, charts, and diagrams in PDFs are vector graphics.

6. Digital Signatures

PKI-backed signatures tied to byte ranges of the file. The signature dictionary specifies "bytes 0 through 12,547 and 14,200 through end-of-file are signed" - a small range in the middle is reserved for the signature value itself. Any change to the signed byte ranges invalidates the signature, which is how PDF detects tampering after signing.

Some PDFs have multiple signatures, layered as incremental updates - each signer signs the file as it existed when they received it, preserving the chain.

Metadata: Two Parallel Systems

PDF has two metadata systems that often disagree:

Standard /Info Dictionary

Stored in the trailer. Fields: Title, Author, Subject, Keywords, Creator (the app the user created the document in), Producer (the app that generated the PDF), CreationDate, ModDate. Plain text strings, easy to read with any PDF tool.

XMP Metadata Stream

A separate XML stream (Adobe XMP, based on RDF/XML) that supports richer schemas: Dublin Core, IPTC, custom domain-specific schemas (color profiles, copyright registrations, manuscript versioning).

Modern PDF generators write to both. Old PDFs only have /Info. Some PDFs have stale /Info from a previous version and accurate XMP from a recent edit - or vice versa. When auditing PDFs for compliance or forensics, check both.

ISO Standards Built on PDF

The base PDF specification is ISO 32000. Several derived standards constrain PDF for specific use cases:

Standard	Use	Constraints
PDF/A	Long-term archival	All fonts embedded, no JavaScript, no audio/video, color spaces device-independent. Conformance levels: PDF/A-1, A-2, A-3 (allows file attachments)
PDF/X	Print production	CMYK color, embedded fonts and color profiles, no transparency (PDF/X-1a) or controlled transparency (PDF/X-4)
PDF/UA	Accessibility	Tagged structure tree, language metadata, alt text for images, logical reading order
PDF/E	Engineering	3D models (U3D, PRC formats), CAD-specific metadata
PDF/VT	Variable transactional printing	Optimized for high-volume personalized mailings

A PDF can comply with multiple standards simultaneously - PDF/A-2u (archival with Unicode mapping) plus PDF/UA (accessibility) is common for government and legal archives.

Linearized PDFs (Web-Optimized)

A "linearized" or "web-optimized" PDF reorders the body so the first page's objects appear early in the file. A web viewer can render page 1 after downloading just the first ~50 KB instead of waiting for the entire file. The trailer is duplicated at the front, plus a hint table that tells the viewer where each page starts.

Most modern PDF generators support linearization as a "Save for Web" option. The format adds 2-5% to file size in exchange for fast first-page rendering over slow connections.

Encryption and Permissions

PDFs can be encrypted with a password (or certificates) and granted granular permissions: print, copy text, modify, fill forms, extract for accessibility. The encryption is stored in the trailer's /Encrypt dictionary.

Encryption strengths have evolved: RC4 40-bit (early PDFs, trivially cracked today), RC4 128-bit (still weak), AES-128, AES-256. The original Acrobat 5 RC4 implementation was cracked publicly in 2001; modern PDF encryption (AES-256, PDF 2.0) is sound when used with strong passwords.

Note: "permissions" are advisory. A viewer that respects them will enforce them. A viewer that ignores them (or a tool that strips the encryption) does not.

How PDFSub Reads PDFs

PDFSub processes PDFs using Rust's PDFium binding (the same engine that powers Chromium's PDF viewer) plus PaddleOCR for scanned documents. For full architecture details and a comparison with cloud-based tools, see Browser vs Cloud PDF Security.

For converting PDFs to other formats while preserving the structure described above:

PDF to Excel - extracts text + tables, preserves coordinates
OCR PDF - adds a searchable text layer to scanned PDFs
PDF to Word - reflows text into editable paragraphs
Compress PDF - downsamples images, subsets fonts

For archival workflows specifically, see How to Convert PDF to PDF/A.

Reading Further

ISO 32000-2 (PDF 2.0 spec) - authoritative reference, paywalled
Adobe PDF Reference Archives - free reference for PDF 1.7
PDF Association - industry working group, free articles and conformance test files

For PDF-specific topics: PDF Compliance Guide for Lawyers, PDF/A Conversion Guide.