Pdf Processing

Extract, merge, and manipulate PDF files with Python automation

✨ The solution you've been looking for

Verified
Tested and verified by our team
16036 Stars

Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

pdf-processing text-extraction document-automation table-extraction form-filling file-merging data-extraction python
Repository

See It In Action

Interactive preview & real-world examples

Live Demo
Skill Demo Animation

AI Conversation Simulator

See how users interact with this skill

User Prompt

I need to extract all tables from this quarterly report PDF and convert them to CSV format

Skill Processing

Analyzing request...

Agent Response

Python code that uses pdfplumber to extract tables and save them as structured CSV files

Quick Start (3 Steps)

Get up and running in minutes

1

Install

claude-code skill install pdf-processing

claude-code skill install pdf-processing
2

Config

3

First Trigger

@pdf-processing help

Commands

CommandDescriptionRequired Args
@pdf-processing extract-data-from-reportsExtract text and tables from PDF reports for analysisNone
@pdf-processing batch-document-processingProcess multiple PDF files to extract text contentNone
@pdf-processing pdf-document-managementMerge, split, or reorganize PDF documentsNone

Typical Use Cases

Extract Data from Reports

Extract text and tables from PDF reports for analysis

Batch Document Processing

Process multiple PDF files to extract text content

PDF Document Management

Merge, split, or reorganize PDF documents

Overview

PDF Processing

Quick start

Use pdfplumber to extract text from PDFs:

1import pdfplumber
2
3with pdfplumber.open("document.pdf") as pdf:
4    text = pdf.pages[0].extract_text()
5    print(text)

Extracting tables

Extract tables from PDFs with automatic detection:

1import pdfplumber
2
3with pdfplumber.open("report.pdf") as pdf:
4    page = pdf.pages[0]
5    tables = page.extract_tables()
6
7    for table in tables:
8        for row in table:
9            print(row)

Extracting all pages

Process multi-page documents efficiently:

1import pdfplumber
2
3with pdfplumber.open("document.pdf") as pdf:
4    full_text = ""
5    for page in pdf.pages:
6        full_text += page.extract_text() + "\n\n"
7
8    print(full_text)

Form filling

For PDF form filling, see FORMS.md for the complete guide including field analysis and validation.

Merging PDFs

Combine multiple PDF files:

1from pypdf import PdfMerger
2
3merger = PdfMerger()
4
5for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
6    merger.append(pdf)
7
8merger.write("merged.pdf")
9merger.close()

Splitting PDFs

Extract specific pages or ranges:

 1from pypdf import PdfReader, PdfWriter
 2
 3reader = PdfReader("input.pdf")
 4writer = PdfWriter()
 5
 6# Extract pages 2-5
 7for page_num in range(1, 5):
 8    writer.add_page(reader.pages[page_num])
 9
10with open("output.pdf", "wb") as output:
11    writer.write(output)

Available packages

  • pdfplumber - Text and table extraction (recommended)
  • pypdf - PDF manipulation, merging, splitting
  • pdf2image - Convert PDFs to images (requires poppler)
  • pytesseract - OCR for scanned PDFs (requires tesseract)

Common patterns

Extract and save text:

1import pdfplumber
2
3with pdfplumber.open("input.pdf") as pdf:
4    text = "\n\n".join(page.extract_text() for page in pdf.pages)
5
6with open("output.txt", "w") as f:
7    f.write(text)

Extract tables to CSV:

 1import pdfplumber
 2import csv
 3
 4with pdfplumber.open("tables.pdf") as pdf:
 5    tables = pdf.pages[0].extract_tables()
 6
 7    with open("output.csv", "w", newline="") as f:
 8        writer = csv.writer(f)
 9        for table in tables:
10            writer.writerows(table)

Error handling

Handle common PDF issues:

 1import pdfplumber
 2
 3try:
 4    with pdfplumber.open("document.pdf") as pdf:
 5        if len(pdf.pages) == 0:
 6            print("PDF has no pages")
 7        else:
 8            text = pdf.pages[0].extract_text()
 9            if text is None or text.strip() == "":
10                print("Page contains no extractable text (might be scanned)")
11            else:
12                print(text)
13except Exception as e:
14    print(f"Error processing PDF: {e}")

Performance tips

  • Process pages in batches for large PDFs
  • Use multiprocessing for multiple files
  • Extract only needed pages rather than entire document
  • Close PDF objects after use

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

pdfplumber (recommended for text/table extraction)
pypdf (for PDF manipulation)
pdf2image (requires poppler for image conversion)
pytesseract (requires tesseract for OCR)

Framework Support

pdfplumber ✓ (recommended) pypdf ✓ pdf2image ✓ pytesseract ✓

Context Window

Token Usage ~1K-3K tokens for typical PDF processing tasks

Security & Privacy

Information

Author
davila7
Updated
2026-01-30
Category
productivity-tools