daft.File: Lazy Metadata Filters

Previously we introduced daft.File — a lazily evaluated file reference that treats unstructured data as a first-class type. This week: opening files and using metadata to control what gets opened.

The pattern

daft.File is lazy. When you call daft.from_files(), nothing downloads. You get lightweight references — millions of them if needed.

from_files() accepts standard glob patterns (docs):

Pattern	Matches
`*`	Any number of characters
`?`	Any single character
`[...]`	Any single character in the brackets
`**`	Directories, recursively

daft.from_files("s3://bucket/docs/**/*.md")           # recursive
daft.from_files("s3://bucket/logs/2026-03-??.jsonl")  # single char wildcard
daft.from_files(["s3://bucket/a/*.pdf", "s3://bucket/b/*.pdf"])  # multiple patterns

The real work starts when a UDF calls .open() or .to_tempfile() inside distributed execution. But you don't want to open 2 million files if you only need 50,000 of them. That's where metadata filtering comes in: file_path(), file_size(), and guess_mime_type() let you narrow the set before any file gets opened. Cheap operations first, expensive operations on the survivors.

Opening files: markdown example

Parse every markdown file in a repository — extract headings into a structured DataFrame:

from collections.abc import Iterator
from typing import TypedDict
 
import daft
from daft import col
from daft.functions import unnest
 
 
class Heading(TypedDict):
    level: int
    text: str
 
 
@daft.func
def extract_headings(file: daft.File) -> Iterator[Heading]:
    with file.open() as f:
        content = f.read().decode("utf-8")
    for line in content.splitlines():
        if line.startswith("#"):
            yield Heading(
                level=len(line) - len(line.lstrip("#")),
                text=line.lstrip("# ").strip(),
            )
 
 
df = (
    daft.from_files("**/*.md")
    .with_column("heading", extract_headings(col("file")))
    .select(col("file"), unnest(col("heading")))
)
 
df.show(10)

Three things worth noting:

from_files("**/*.md") already returns a file column of type daft.File — no cast or separate setup step needed. The glob pattern handles filtering to .md files directly.
@daft.func with Iterator[Heading] is row-generating: each yield becomes a separate row as a struct. Use unnest to expand the struct fields (level, text) into columns. This is different from a UDF that returns a list — those use explode.
The engine handles distribution across partitions — same code works on 10 files or 10,000.

Filtering before you open

file_path() requires daft>=0.7.9.

Opening files is the expensive part. file_path() and guess_mime_type() let you filter at the reference level — no I/O, no egress.

By path

import daft
from daft import col
from daft.functions import file_path
 
path = file_path(col("file"))
 
# Only markdown files from the docs directory
df = (
    daft.from_files("s3://repo/**/*")
    .where(path.endswith(".md"))
    .where(path.contains("/docs/"))
)

By content type

Extension matching is fast but unreliable — renamed files, missing extensions. guess_mime_type() inspects magic bytes:

import daft
from daft import col
from daft.functions import guess_mime_type
 
df = daft.from_files("s3://inbox/**/*")
df = df.with_column("mime", guess_mime_type(col("file")))
 
pdfs = df.where(col("mime") == "application/pdf")
images = df.where(col("mime").startswith("image/"))

Extract metadata from paths

# Partition by date from filenames
df = (
    daft.from_files("s3://logs/events-*.jsonl")
    .with_column("filename", file_path(col("file")).regexp_extract(r"[^/]+$", 0))
    .with_column("date", col("filename").regexp_extract(r"(\d{4}-\d{2}-\d{2})", 0))
    .where(col("date") >= "2026-03-01")
)

What else can you open

The same .open() and .to_tempfile() interface works for any file type. Run these yourself:

PDFs — extract page text and rendered images with PyMuPDF (daft_file_pdf.py)
Python source — parse ASTs, extract function signatures and docstrings (daft_file_code.py)
Audio — resample, transcribe, extract metadata (daft_audiofile.py)
Video — frame extraction, keyframes, streaming (daft_videofile.py)

All examples: Eventual-Inc/daft-examples/examples/files