CrazyBoyM 85f44c358a Complete rewrite: original educational content only
- Remove all reverse-engineered Claude Code source code
- Replace with 100% original educational content from mini-claude-code
- Add clear disclaimer: independent project, not affiliated with Anthropic
- 5 progressive agent implementations (v0-v4, ~1100 lines total)
- Include agent-builder skill for teaching agent construction
- Bilingual documentation (EN + ZH)

This repository now focuses purely on teaching how modern AI agents work
through original, from-scratch implementations.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-31 07:01:42 +08:00

2.5 KiB

name description
pdf Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files.

PDF Processing Skill

You now have expertise in PDF manipulation. Follow these workflows:

Reading PDFs

Option 1: Quick text extraction (preferred)

# Using pdftotext (poppler-utils)
pdftotext input.pdf -  # Output to stdout
pdftotext input.pdf output.txt  # Output to file

# If pdftotext not available, try:
python3 -c "
import fitz  # PyMuPDF
doc = fitz.open('input.pdf')
for page in doc:
    print(page.get_text())
"

Option 2: Page-by-page with metadata

import fitz  # pip install pymupdf

doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")

for i, page in enumerate(doc):
    text = page.get_text()
    print(f"--- Page {i+1} ---")
    print(text)

Creating PDFs

Option 1: From Markdown (recommended)

# Using pandoc
pandoc input.md -o output.pdf

# With custom styling
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in

Option 2: Programmatically

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()

Option 3: From HTML

# Using wkhtmltopdf
wkhtmltopdf input.html output.pdf

# Or with Python
python3 -c "
import pdfkit
pdfkit.from_file('input.html', 'output.pdf')
"

Merging PDFs

import fitz

result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    doc = fitz.open(pdf_path)
    result.insert_pdf(doc)
result.save("merged.pdf")

Splitting PDFs

import fitz

doc = fitz.open("input.pdf")
for i in range(len(doc)):
    single = fitz.open()
    single.insert_pdf(doc, from_page=i, to_page=i)
    single.save(f"page_{i+1}.pdf")

Key Libraries

Task Library Install
Read/Write/Merge PyMuPDF pip install pymupdf
Create from scratch ReportLab pip install reportlab
HTML to PDF pdfkit pip install pdfkit + wkhtmltopdf
Text extraction pdftotext brew install poppler / apt install poppler-utils

Best Practices

  1. Always check if tools are installed before using them
  2. Handle encoding issues - PDFs may contain various character encodings
  3. Large PDFs: Process page by page to avoid memory issues
  4. OCR for scanned PDFs: Use pytesseract if text extraction returns empty