How to Extract Tables from PDF to Excel: 5 Methods Compared

You have a PDF with a table you need in Excel. Maybe it's a financial report, a bank statement, an invoice, or a research paper. The data is right there - neatly organized in rows and columns on the screen. But when you try to get it out, everything falls apart.

This happens because PDF isn't a data format. It's a display format. There's no concept of a "table," "row," or "column" in the PDF specification. What looks like a structured table is actually dozens of text fragments placed at specific x,y coordinates on a canvas. Extracting that structure back into a spreadsheet is a reverse-engineering problem - and different tools handle it with varying degrees of success.

This guide covers 5 methods for extracting tables from PDFs, when each one works best, and what to do when things go wrong.

Why Table Extraction from PDFs Is Hard

5 Methods for Extracting PDF Tables to Excel - Accuracy Comparison

The PDF Format Has No Tables

The PDF specification (ISO 32000-2:2020) defines a content stream - a sequence of operators that position individual characters at precise coordinates. A simple table row like "Date | Description | Amount" might be stored as:

BT /F1 10 Tf 72 650 Td (01/15/2026) Tj 200 0 Td (Office Supplies) Tj 180 0 Td (125.00) Tj ET

There are no <table>, <tr>, or <td> tags. No row identifiers. No column boundaries. The visual lines you see around cells are separate drawing operations completely disconnected from the text. An extraction tool must infer the entire structure from spatial relationships.

Three Types of Table Borders

Bordered (Lattice) tables have visible lines around every cell. These are the easiest to extract because the lines explicitly define cell boundaries. Common in formal financial statements, government forms, and standardized reports.

Borderless (Stream) tables have no lines at all. Structure is defined entirely by whitespace alignment - text items sharing consistent x-coordinates across rows form implied columns. Common in research papers, invoices, and product catalogs.

Semi-bordered tables have only partial borders - typically horizontal rules between sections but no vertical dividers. Extremely common in bank statements, brokerage reports, and utility bills. These are the hardest to extract because partial borders mislead lattice-mode parsers while missing borders reduce stream-mode confidence.

Tagged vs. Untagged PDFs

Tagged PDFs include structural metadata that identifies headings, paragraphs, and table cells. Untagged PDFs have none of this - the extraction tool gets only raw coordinates. The vast majority of PDFs are untagged, including virtually all bank statements, invoices, and financial reports.

Method 1: PDFSub Extract Tables (Free + AI Fallback)

PDFSub's Extract Tables tool uses a three-tier approach that maximizes accuracy while minimizing cost:

Tier 1: Coordinate-Based Detection (Browser, Free)

The tool first attempts extraction entirely in your browser:

Parses the PDF content stream to extract every text item with its x,y coordinates
Groups text items into lines based on y-coordinate proximity
Analyzes x-coordinate alignment patterns across lines to detect column boundaries
Requires minimum 3 rows, 2 columns, and 70%+ confidence

If good tables are found, you get structured data immediately - no server upload, no AI credits consumed, and your file never leaves your device.

Tier 2: Server-Side Extraction (pdfplumber, Free)

If coordinate-based detection finds no tables, the tool uses pdfplumber (MIT license) on the server. This detects both explicit lines (drawn borders) and implied lines (word alignment patterns), finds intersections, identifies rectangles, and maps text to cells.

Tier 3: AI Extraction (Uses Credits)

For scanned PDFs, complex layouts, or tables that rule-based methods can't parse, the tool falls back to AI-powered vision extraction. You can also toggle "Force AI extraction" to skip directly to this tier when you know the table is complex.

Output formats: Excel (.xlsx), CSV, JSON.

Best for: Quick extraction without installing software. Digital PDFs are processed entirely in your browser for maximum privacy.

Method 2: Power Query in Excel (Windows Only)

Available in Excel 2019+ and Microsoft 365 on Windows: Data → Get Data → From File → From PDF.

How It Works

Click Data → Get Data → From File → From PDF
Select your PDF file
Power Query displays a Navigator panel listing detected tables per page
Select the tables you want, click Transform Data to clean up, then Load

Strengths

Built into Excel - no additional cost for Microsoft 365 subscribers
Power Query's transformation engine handles post-processing well (fill down, pivot, merge columns)
Can refresh data if the source PDF is updated
Supports connecting multiple tables from the same PDF

Limitations

Windows only - not available in Excel for Mac, Excel Online, or mobile
Struggles with borderless tables - works best with clearly bordered tables
No OCR - cannot extract from scanned/image PDFs
Multi-page tables are problematic - each page often imports as a separate table, requiring manual stitching
Multi-line rows - wrapped text within cells often splits into multiple rows, requiring cleanup

Best for: Windows users with Microsoft 365 who have simple, bordered tables.

Method 3: Adobe Acrobat (Paid)

File → Export a PDF → Spreadsheet → Microsoft Excel Workbook

Pricing (2026)

Acrobat Standard: $12.99/month (annual plan)
Acrobat Pro: $19.99/month (annual plan)
Export PDF (standalone): lower-tier conversion-only plan

Strengths

Built-in OCR for scanned documents
Generally preserves formatting for simple bordered tables
Batch processing available in Pro

Limitations

Expensive for table extraction alone - $156–$240/year
Complex tables with merged cells and multi-page spans still produce misaligned output
Files may be uploaded to Adobe's cloud for processing - problematic for sensitive financial data
Requires desktop installation

Best for: Users who already pay for Acrobat Pro and need occasional table exports with OCR.

Method 4: Copy-Paste (Manual)

The most intuitive approach - and the one that fails most often for tables.

Common Problems

All data in one column - the entire table pastes with no column breaks
Numbers become text - currency symbols, parentheses, and separators break numeric formatting
Multi-line cell content creates phantom rows - a description that wraps across two lines in the cell becomes two separate rows
Headers separated from data - the header row gets disconnected
Columns misaligned - data shifts because character spacing doesn't translate to tabs

Partial Workaround

Paste into Excel, then use Data → Text to Columns with space or fixed-width delimiters. Enable "Treat consecutive delimiters as one." This works for very simple, well-spaced tables but fails for anything with multi-word cell content.

Best for: Extracting a single small, simple table as a last resort.

Method 5: Python Libraries (For Developers)

Three MIT-licensed libraries handle PDF table extraction programmatically:

Tabula-py

Python wrapper around Tabula (Java). Requires Java runtime.

Lattice mode for bordered tables (finds lines and intersections)
Stream mode for borderless tables (uses text alignment)
Good for batch processing in scripts
No OCR support

Camelot

Also offers lattice and stream modes.

Generally outperforms Tabula for bordered tables
Stream mode has more configuration parameters for fine-tuning
Provides accuracy reports with each extraction
Requires Ghostscript dependency. No OCR support

pdfplumber

Coordinate-based approach: extracts every character with its exact position, then infers structure.

Handles the widest range of table types
Gives the most control but requires more configuration
This is the library PDFSub uses server-side
No OCR support

Best for: Developers automating recurring table extraction workflows, processing large batches of similar documents.

Common Problems and How to Solve Them

Merged Cells

When cells span multiple rows or columns, most tools either place content in the top-left cell and leave others empty, or misalign all subsequent columns. There's no universal solution - CSV format has no merge concept, so merge information is always lost.

Fix: Extract the table, then manually fix merge artifacts in Excel. For recurring tables with the same merge pattern, consider a post-processing script.

Multi-Line Content Within Cells

Long descriptions that wrap within a cell become multiple rows in the output, pushing all subsequent data out of alignment. This is the single most common extraction error for financial documents.

Fix: After extraction, look for rows that are missing dates and amounts - these are likely continuation lines that belong to the row above. In Excel, merge them manually or use a helper formula.

Tables Spanning Multiple Pages

Tools must determine where the table continues, whether to strip repeated headers, and how to filter page footers. Many tools treat each page independently.

Fix: If your tool gives per-page results, combine the sheets and remove repeated header rows. Check that the last row on page N connects correctly to the first row on page N+1.

Currency Formatting Issues

Negative numbers in parentheses ((1,234.56)) paste as text, not numbers. Currency symbols and thousand separators also break numeric formatting.

Fix: After extraction, select the amount column and use Find & Replace to remove $, (, ) characters. Then format the column as Number. For parenthesized negatives, replace ( with - and remove ), then convert to Number format.

Date Ambiguity

01/02/2026 - is that January 2 or February 1? The extraction tool preserves the string as-is, but Excel may reinterpret it based on your locale.

Fix: Check the source PDF for date format clues (look for dates with day values > 12). Set Excel's date format to match the source before importing.

Accuracy Comparison

Method	Simple Bordered	Borderless	Semi-bordered	Scanned PDFs
PDFSub (coordinate + AI)	90–99%	75–95%	70–95%	85–95% (AI)
Power Query	85–95%	40–60%	50–70%	Not supported
Adobe Acrobat	90–95%	70–80%	70–85%	80–90%
Tabula	~68%	55–70%	50–65%	Not supported
Camelot	~73%	65–75%	60–70%	Not supported
Copy-paste	30–50%	10–30%	10–30%	Not possible

Ranges reflect variation across document complexity. Benchmark data from Procycons 2025 PDF Extraction Benchmark and Camelot comparison studies.

Which Method Should You Use?

Scenario	Best Method	Why
Quick one-off extraction	PDFSub	No install, browser-based, free coordinate extraction
Simple bordered table, Windows	Power Query	Built into Excel, no additional cost
Scanned PDF	PDFSub (AI) or Adobe Acrobat	Need OCR capability
Sensitive financial data	PDFSub	Browser-based processing, file never uploaded
Recurring batch processing	Python (pdfplumber)	Scriptable, automatable
Already have Acrobat Pro	Adobe Acrobat	Already paying, simple tables work well
Single small table, no tools	Copy-paste	Last resort, verify everything

Tips for Best Results

Use native PDFs. Download documents from their source rather than scanning paper. Native PDFs have perfect text, making extraction dramatically more accurate.

Identify the table type first. Bordered tables work with almost any tool. Borderless tables need stream-mode or AI extraction. Knowing the type helps you choose the right method upfront.

Start with free, rule-based methods. Try coordinate-based extraction first. Only escalate to AI when rule-based methods produce poor results - this saves time and credits.

Always verify the output. Check row count, column alignment, numeric values, and totals. Never trust extraction output blindly.

Watch for number formatting. After extraction, verify numbers are actually numbers in Excel (right-aligned), not text strings (left-aligned). Currency symbols and parenthesized negatives are common culprits.

For sensitive data, prefer browser-based tools. Financial reports, bank statements, and tax documents contain sensitive information. Tools that process PDFs in your browser never upload your file, eliminating data exposure risk.

Try It Free

Ready to extract tables from your PDF? Upload a file now - PDFSub tries free coordinate-based extraction first, with AI fallback for complex tables. Digital PDFs are processed entirely in your browser. Start a 7-day free trial.

How to Extract Tables from PDF to Excel: 5 Methods Compared

This guide covers 5 methods for extracting tables from PDFs, when each one works best, and what to do when things go wrong.

Why Table Extraction from PDFs Is Hard

5 Methods for Extracting PDF Tables to Excel - Accuracy Comparison

The PDF Format Has No Tables

BT /F1 10 Tf 72 650 Td (01/15/2026) Tj 200 0 Td (Office Supplies) Tj 180 0 Td (125.00) Tj ET

Three Types of Table Borders

Tagged vs. Untagged PDFs

Method 1: PDFSub Extract Tables (Free + AI Fallback)

PDFSub's Extract Tables tool uses a three-tier approach that maximizes accuracy while minimizing cost:

Tier 1: Coordinate-Based Detection (Browser, Free)

The tool first attempts extraction entirely in your browser:

Parses the PDF content stream to extract every text item with its x,y coordinates
Groups text items into lines based on y-coordinate proximity
Analyzes x-coordinate alignment patterns across lines to detect column boundaries
Requires minimum 3 rows, 2 columns, and 70%+ confidence

If good tables are found, you get structured data immediately - no server upload, no AI credits consumed, and your file never leaves your device.

Tier 2: Server-Side Extraction (pdfplumber, Free)

Tier 3: AI Extraction (Uses Credits)

Output formats: Excel (.xlsx), CSV, JSON.

Best for: Quick extraction without installing software. Digital PDFs are processed entirely in your browser for maximum privacy.

Method 2: Power Query in Excel (Windows Only)

Available in Excel 2019+ and Microsoft 365 on Windows: Data → Get Data → From File → From PDF.

How It Works

Click Data → Get Data → From File → From PDF
Select your PDF file
Power Query displays a Navigator panel listing detected tables per page
Select the tables you want, click Transform Data to clean up, then Load

Strengths

Built into Excel - no additional cost for Microsoft 365 subscribers
Power Query's transformation engine handles post-processing well (fill down, pivot, merge columns)
Can refresh data if the source PDF is updated
Supports connecting multiple tables from the same PDF

Limitations

Windows only - not available in Excel for Mac, Excel Online, or mobile
Struggles with borderless tables - works best with clearly bordered tables
No OCR - cannot extract from scanned/image PDFs
Multi-page tables are problematic - each page often imports as a separate table, requiring manual stitching
Multi-line rows - wrapped text within cells often splits into multiple rows, requiring cleanup

Best for: Windows users with Microsoft 365 who have simple, bordered tables.

Method 3: Adobe Acrobat (Paid)

File → Export a PDF → Spreadsheet → Microsoft Excel Workbook

Pricing (2026)

Acrobat Standard: $12.99/month (annual plan)
Acrobat Pro: $19.99/month (annual plan)
Export PDF (standalone): lower-tier conversion-only plan

Strengths

Built-in OCR for scanned documents
Generally preserves formatting for simple bordered tables
Batch processing available in Pro

Limitations

Expensive for table extraction alone - $156–$240/year
Complex tables with merged cells and multi-page spans still produce misaligned output
Files may be uploaded to Adobe's cloud for processing - problematic for sensitive financial data
Requires desktop installation

Best for: Users who already pay for Acrobat Pro and need occasional table exports with OCR.

Method 4: Copy-Paste (Manual)

The most intuitive approach - and the one that fails most often for tables.

Common Problems

All data in one column - the entire table pastes with no column breaks
Numbers become text - currency symbols, parentheses, and separators break numeric formatting
Multi-line cell content creates phantom rows - a description that wraps across two lines in the cell becomes two separate rows
Headers separated from data - the header row gets disconnected
Columns misaligned - data shifts because character spacing doesn't translate to tabs

Partial Workaround

Best for: Extracting a single small, simple table as a last resort.

Method 5: Python Libraries (For Developers)

Three MIT-licensed libraries handle PDF table extraction programmatically:

Tabula-py

Python wrapper around Tabula (Java). Requires Java runtime.

Lattice mode for bordered tables (finds lines and intersections)
Stream mode for borderless tables (uses text alignment)
Good for batch processing in scripts
No OCR support

Camelot

Also offers lattice and stream modes.

Generally outperforms Tabula for bordered tables
Stream mode has more configuration parameters for fine-tuning
Provides accuracy reports with each extraction
Requires Ghostscript dependency. No OCR support

pdfplumber

Coordinate-based approach: extracts every character with its exact position, then infers structure.

Handles the widest range of table types
Gives the most control but requires more configuration
This is the library PDFSub uses server-side
No OCR support

Best for: Developers automating recurring table extraction workflows, processing large batches of similar documents.

Common Problems and How to Solve Them

Merged Cells

Fix: Extract the table, then manually fix merge artifacts in Excel. For recurring tables with the same merge pattern, consider a post-processing script.

Multi-Line Content Within Cells

Long descriptions that wrap within a cell become multiple rows in the output, pushing all subsequent data out of alignment. This is the single most common extraction error for financial documents.

Fix: After extraction, look for rows that are missing dates and amounts - these are likely continuation lines that belong to the row above. In Excel, merge them manually or use a helper formula.

Tables Spanning Multiple Pages

Tools must determine where the table continues, whether to strip repeated headers, and how to filter page footers. Many tools treat each page independently.

Fix: If your tool gives per-page results, combine the sheets and remove repeated header rows. Check that the last row on page N connects correctly to the first row on page N+1.

Currency Formatting Issues

Negative numbers in parentheses ((1,234.56)) paste as text, not numbers. Currency symbols and thousand separators also break numeric formatting.

Date Ambiguity

01/02/2026 - is that January 2 or February 1? The extraction tool preserves the string as-is, but Excel may reinterpret it based on your locale.

Fix: Check the source PDF for date format clues (look for dates with day values > 12). Set Excel's date format to match the source before importing.

Accuracy Comparison

Method	Simple Bordered	Borderless	Semi-bordered	Scanned PDFs
PDFSub (coordinate + AI)	90–99%	75–95%	70–95%	85–95% (AI)
Power Query	85–95%	40–60%	50–70%	Not supported
Adobe Acrobat	90–95%	70–80%	70–85%	80–90%
Tabula	~68%	55–70%	50–65%	Not supported
Camelot	~73%	65–75%	60–70%	Not supported
Copy-paste	30–50%	10–30%	10–30%	Not possible

Ranges reflect variation across document complexity. Benchmark data from Procycons 2025 PDF Extraction Benchmark and Camelot comparison studies.

Which Method Should You Use?

Scenario	Best Method	Why
Quick one-off extraction	PDFSub	No install, browser-based, free coordinate extraction
Simple bordered table, Windows	Power Query	Built into Excel, no additional cost
Scanned PDF	PDFSub (AI) or Adobe Acrobat	Need OCR capability
Sensitive financial data	PDFSub	Browser-based processing, file never uploaded
Recurring batch processing	Python (pdfplumber)	Scriptable, automatable
Already have Acrobat Pro	Adobe Acrobat	Already paying, simple tables work well
Single small table, no tools	Copy-paste	Last resort, verify everything

Tips for Best Results

Use native PDFs. Download documents from their source rather than scanning paper. Native PDFs have perfect text, making extraction dramatically more accurate.

Identify the table type first. Bordered tables work with almost any tool. Borderless tables need stream-mode or AI extraction. Knowing the type helps you choose the right method upfront.

Start with free, rule-based methods. Try coordinate-based extraction first. Only escalate to AI when rule-based methods produce poor results - this saves time and credits.

Always verify the output. Check row count, column alignment, numeric values, and totals. Never trust extraction output blindly.