Back to Blog
April 21, 2026
daft.File: Lazy Metadata Filters

daft.File: Lazy Metadata Filters

Filter millions of files by path, size, and content type before opening any of them. Cheap operations first, expensive operations on the survivors.

by Everett Kleven

Previously we introduced daft.File — a lazily evaluated file reference that treats unstructured data as a first-class type. This week: opening files and using metadata to control what gets opened.

The pattern

daft.File is lazy. When you call daft.from_files(), nothing downloads. You get lightweight references — millions of them if needed.

from_files() accepts standard glob patterns (docs):

PatternMatches
*Any number of characters
?Any single character
[...]Any single character in the brackets
**Directories, recursively
daft.from_files("s3://bucket/docs/**/*.md")           # recursive
daft.from_files("s3://bucket/logs/2026-03-??.jsonl")  # single char wildcard
daft.from_files(["s3://bucket/a/*.pdf", "s3://bucket/b/*.pdf"])  # multiple patterns

The real work starts when a UDF calls .open() or .to_tempfile() inside distributed execution. But you don't want to open 2 million files if you only need 50,000 of them. That's where metadata filtering comes in: file_path(), file_size(), and guess_mime_type() let you narrow the set before any file gets opened. Cheap operations first, expensive operations on the survivors.

Opening files: markdown example

Parse every markdown file in a repository — extract headings into a structured DataFrame:

from collections.abc import Iterator
from typing import TypedDict
 
import daft
from daft import col
from daft.functions import unnest
 
 
class Heading(TypedDict):
    level: int
    text: str
 
 
@daft.func
def extract_headings(file: daft.File) -> Iterator[Heading]:
    with file.open() as f:
        content = f.read().decode("utf-8")
    for line in content.splitlines():
        if line.startswith("#"):
            yield Heading(
                level=len(line) - len(line.lstrip("#")),
                text=line.lstrip("# ").strip(),
            )
 
 
df = (
    daft.from_files("**/*.md")
    .with_column("heading", extract_headings(col("file")))
    .select(col("file"), unnest(col("heading")))
)
 
df.show(10)

Three things worth noting:

  • from_files("**/*.md") already returns a file column of type daft.File — no cast or separate setup step needed. The glob pattern handles filtering to .md files directly.
  • @daft.func with Iterator[Heading] is row-generating: each yield becomes a separate row as a struct. Use unnest to expand the struct fields (level, text) into columns. This is different from a UDF that returns a list — those use explode.
  • The engine handles distribution across partitions — same code works on 10 files or 10,000.

Filtering before you open

file_path() requires daft>=0.7.9.

Opening files is the expensive part. file_path() and guess_mime_type() let you filter at the reference level — no I/O, no egress.

By path

import daft
from daft import col
from daft.functions import file_path
 
path = file_path(col("file"))
 
# Only markdown files from the docs directory
df = (
    daft.from_files("s3://repo/**/*")
    .where(path.endswith(".md"))
    .where(path.contains("/docs/"))
)

By content type

Extension matching is fast but unreliable — renamed files, missing extensions. guess_mime_type() inspects magic bytes:

import daft
from daft import col
from daft.functions import guess_mime_type
 
df = daft.from_files("s3://inbox/**/*")
df = df.with_column("mime", guess_mime_type(col("file")))
 
pdfs = df.where(col("mime") == "application/pdf")
images = df.where(col("mime").startswith("image/"))

Extract metadata from paths

# Partition by date from filenames
df = (
    daft.from_files("s3://logs/events-*.jsonl")
    .with_column("filename", file_path(col("file")).regexp_extract(r"[^/]+$", 0))
    .with_column("date", col("filename").regexp_extract(r"(\d{4}-\d{2}-\d{2})", 0))
    .where(col("date") >= "2026-03-01")
)

What else can you open

The same .open() and .to_tempfile() interface works for any file type. Run these yourself:

All examples: Eventual-Inc/daft-examples/examples/files

Suggested Posts

Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle multimodal datasets, optimize performance, and simplify your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo