attachments

Turn anything into LLM-ready context.

A scanned PDF, a spreadsheet, a folder of images, a two-hour recording — you have the file and the question. attachments turns the file into something a model can actually read, so you can get to the question.

$ pip install attachments

In a notebook — Jupyter, Colab, RStudio, Positron, Quarto:

>>> from attachments import att >>> att("report.pdf[pages: 1-4, images: true]") <Artifacts: 1 artifact | 202 chars | ~51 tokens | 4 images> >>> a = att("meeting.mp3") + att("data.xlsx") + att("github://you/repo") >>> a.claude("What should we do next quarter?") # ready-to-send messages

You never tell it what kind of file it is, and nothing here raises an exception. When something needs attention, the answer comes back as data that tells you the fix:

>>> att("scan.pdf") # a scanned PDF, no OCR installed <Artifacts: 1 artifact | 0 chars | ~0 tokens | 1 image> * scan.pdf: No text layer (scanned?) - pip install attachments[ocr], or the free hosted tier: attachments.dev

No Python? Try it in the playground — drop a file, see exactly what a model would see.

What it eats

Scans, spreadsheets, slide decks, Word documents, recordings, photos, notebooks, whole folders, zip files, web pages, GitHub repos. If it has a wrong or missing extension, the bytes are sniffed and routed anyway.

.pdf + OCR.docx.pptx.xlsx.csv/.tsv.html.ipynb.md .py .json… 20+ text.png .jpg .heic… images.svg.mp3 .wav .m4a… transcription

And every source works with every format:

filesdirectoriesglobs **/*.pyzip / tarhttps://github://owner/repo

Need just part of a file? Say it in plain options — att("data.xlsx[sheet: Sales, rows: 100]") — the full option reference is generated from the code itself. Typos get corrected: Unknown option 'sheets' for .xlsx — did you mean 'sheet'?

When the install gets painful, we run it for you — free

Most formats install in seconds. Two don't: OCR wants onnxruntime, transcription wants whisper weights — heavy downloads that fail on locked-down lab and university machines. So we keep them warm on our server. Two lines, no account, no key:

>>> from attachments import att, configure >>> configure(service_url="https://api.attachments.dev/v1") >>> att("scan.pdf") # OCR'd remotely; same Artifact back <Artifacts: 1 artifact | 37 chars | ~10 tokens | 1 image>

Free: files up to 25 MB, 10 requests per minute. Your files are processed in memory and never stored.
When you need more — bigger files, higher volume, GPU OCR, video — a paid tier will remove the limits. Until then, this is the whole deal.
It's the same open-source server you can self-host. The hosted tier is convenience, not capability.

You can leave at any time

Everything on this page is open source (MIT) — the library, the parsers, the spec, and the entire hosted service. Self-host it with one command:

$ pip install attachments[server] && attachments-server # or: docker compose up

No telemetry, ever. No accounts. The library makes no network calls unless you point it at a URL — or explicitly opt into the service above.

For developers: built on a one-page contract

If you just want your files read, you're done — everything above is the product. This part is for people building on top of it.

The Artifact

Every input becomes {text, images, audio, video, meta} — typed errors, page/sheet/slide segments with offsets, validated against a JSON Schema by a conformance suite in CI.

The DSL

"file.pdf[pages: 1-4, ocr: true]" — a specified grammar with shared test vectors, so every future language port parses it identically. Every option has a kwargs twin.

Adding a format is one pure function (bytes, options) → Artifact — the contributor guide is a checklist, and the conformance suite picks your processor up automatically. Agents get the same power through the MCP server; everything else gets the HTTP API. Python today; the contract is designed so R, Julia, and TypeScript clients parse the same DSL and return the same Artifact.