mirror of
https://github.com/shareAI-lab/analysis_claude_code.git
synced 2026-02-04 13:16:37 +08:00
- Remove all reverse-engineered Claude Code source code - Replace with 100% original educational content from mini-claude-code - Add clear disclaimer: independent project, not affiliated with Anthropic - 5 progressive agent implementations (v0-v4, ~1100 lines total) - Include agent-builder skill for teaching agent construction - Bilingual documentation (EN + ZH) This repository now focuses purely on teaching how modern AI agents work through original, from-scratch implementations. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
113 lines
2.5 KiB
Markdown
113 lines
2.5 KiB
Markdown
---
|
|
name: pdf
|
|
description: Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files.
|
|
---
|
|
|
|
# PDF Processing Skill
|
|
|
|
You now have expertise in PDF manipulation. Follow these workflows:
|
|
|
|
## Reading PDFs
|
|
|
|
**Option 1: Quick text extraction (preferred)**
|
|
```bash
|
|
# Using pdftotext (poppler-utils)
|
|
pdftotext input.pdf - # Output to stdout
|
|
pdftotext input.pdf output.txt # Output to file
|
|
|
|
# If pdftotext not available, try:
|
|
python3 -c "
|
|
import fitz # PyMuPDF
|
|
doc = fitz.open('input.pdf')
|
|
for page in doc:
|
|
print(page.get_text())
|
|
"
|
|
```
|
|
|
|
**Option 2: Page-by-page with metadata**
|
|
```python
|
|
import fitz # pip install pymupdf
|
|
|
|
doc = fitz.open("input.pdf")
|
|
print(f"Pages: {len(doc)}")
|
|
print(f"Metadata: {doc.metadata}")
|
|
|
|
for i, page in enumerate(doc):
|
|
text = page.get_text()
|
|
print(f"--- Page {i+1} ---")
|
|
print(text)
|
|
```
|
|
|
|
## Creating PDFs
|
|
|
|
**Option 1: From Markdown (recommended)**
|
|
```bash
|
|
# Using pandoc
|
|
pandoc input.md -o output.pdf
|
|
|
|
# With custom styling
|
|
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
|
|
```
|
|
|
|
**Option 2: Programmatically**
|
|
```python
|
|
from reportlab.lib.pagesizes import letter
|
|
from reportlab.pdfgen import canvas
|
|
|
|
c = canvas.Canvas("output.pdf", pagesize=letter)
|
|
c.drawString(100, 750, "Hello, PDF!")
|
|
c.save()
|
|
```
|
|
|
|
**Option 3: From HTML**
|
|
```bash
|
|
# Using wkhtmltopdf
|
|
wkhtmltopdf input.html output.pdf
|
|
|
|
# Or with Python
|
|
python3 -c "
|
|
import pdfkit
|
|
pdfkit.from_file('input.html', 'output.pdf')
|
|
"
|
|
```
|
|
|
|
## Merging PDFs
|
|
|
|
```python
|
|
import fitz
|
|
|
|
result = fitz.open()
|
|
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
|
|
doc = fitz.open(pdf_path)
|
|
result.insert_pdf(doc)
|
|
result.save("merged.pdf")
|
|
```
|
|
|
|
## Splitting PDFs
|
|
|
|
```python
|
|
import fitz
|
|
|
|
doc = fitz.open("input.pdf")
|
|
for i in range(len(doc)):
|
|
single = fitz.open()
|
|
single.insert_pdf(doc, from_page=i, to_page=i)
|
|
single.save(f"page_{i+1}.pdf")
|
|
```
|
|
|
|
## Key Libraries
|
|
|
|
| Task | Library | Install |
|
|
|------|---------|---------|
|
|
| Read/Write/Merge | PyMuPDF | `pip install pymupdf` |
|
|
| Create from scratch | ReportLab | `pip install reportlab` |
|
|
| HTML to PDF | pdfkit | `pip install pdfkit` + wkhtmltopdf |
|
|
| Text extraction | pdftotext | `brew install poppler` / `apt install poppler-utils` |
|
|
|
|
## Best Practices
|
|
|
|
1. **Always check if tools are installed** before using them
|
|
2. **Handle encoding issues** - PDFs may contain various character encodings
|
|
3. **Large PDFs**: Process page by page to avoid memory issues
|
|
4. **OCR for scanned PDFs**: Use `pytesseract` if text extraction returns empty
|