pathlib Filesystem — Deep Dive

Inside Python's pathlib: PurePath vs Path hierarchy, __fspath__ protocol, performance characteristics, and production file handling patterns

The pathlib Class Hierarchy

pathlib has two layers:

PurePath (no I/O — pure path manipulation)
├── PurePosixPath
└── PureWindowsPath

Path(PurePath) (adds I/O operations)
├── PosixPath(Path, PurePosixPath)
└── WindowsPath(Path, PureWindowsPath)

PurePath handles string manipulation: joining, splitting, matching. Path adds filesystem operations: reading, writing, checking existence. This separation means you can manipulate Windows paths on Linux (for config parsing, URL building) without needing actual Windows filesystem access.

from pathlib import PureWindowsPath

# Parse Windows paths on Linux
p = PureWindowsPath(r"C:\Users\Alice\report.pdf")
print(p.name)    # 'report.pdf'
print(p.parent)  # PureWindowsPath('C:/Users/Alice')
print(p.drive)   # 'C:'

The fspath Protocol (PEP 519)

Many libraries still expect string paths. Python 3.6 introduced os.fspath() and the __fspath__ protocol so Path objects work seamlessly:

import os
from pathlib import Path

p = Path("/tmp/data.csv")
os.fspath(p)  # '/tmp/data.csv' — string for C libraries

# These all accept Path objects directly:
open(p)              # Built-in open
os.stat(p)           # os module
shutil.copy(p, dst)  # shutil

Under the hood, functions call os.fspath(arg) which checks for __fspath__() method. If you build custom path-like objects, implement this method to integrate with the ecosystem.

The / Operator Internals

The / operator for path joining is implemented via __truediv__ and __rtruediv__:

class PurePath:
    def __truediv__(self, key):
        return self._make_child((key,))

    def __rtruediv__(self, key):
        return type(self)(key, self)

This allows both directions:

Path("/home") / "alice"     # Path.__truediv__("alice")
"/home" / Path("alice")     # str doesn't know Path, so Python calls Path.__rtruediv__("/home")

Glob Implementation Details

Path.glob() uses fnmatch patterns, not regular expressions:

Pattern	Meaning
`*`	Match everything except `/`
`**`	Match zero or more directories (recursive)
`?`	Match any single character
`[abc]`	Match one of a, b, or c
`[!abc]`	Match anything except a, b, or c

Performance of glob vs os.scandir

Path.glob() internally uses os.scandir(), which is significantly faster than os.listdir() because it retrieves file type information alongside names (via readdir on Unix, FindNextFile on Windows):

from pathlib import Path
import time

# glob is lazy — it yields results as found
start = time.perf_counter()
for p in Path("/usr").rglob("*.py"):
    pass  # Process one at a time, no memory spike
elapsed = time.perf_counter() - start

For very large directories (100K+ files), consider os.scandir() directly if you need maximum control over buffering and filtering.

Walk pattern (Python 3.12+)

# New in 3.12: Path.walk() — replaces os.walk()
for dirpath, dirnames, filenames in Path("/project").walk():
    for name in filenames:
        if name.endswith(".py"):
            print(dirpath / name)

Atomic File Operations

pathlib’s write_text() and write_bytes() are not atomic — a crash during write can leave a corrupt file. For production use:

from pathlib import Path
import tempfile
import os

def atomic_write(path: Path, content: str, encoding: str = "utf-8"):
    """Write atomically using temp file + rename."""
    path = Path(path)
    fd, tmp_path = tempfile.mkstemp(
        dir=path.parent,
        prefix=f".{path.name}.",
        suffix=".tmp"
    )
    try:
        with os.fdopen(fd, 'w', encoding=encoding) as f:
            f.write(content)
        os.replace(tmp_path, path)  # Atomic on POSIX
    except:
        os.unlink(tmp_path)
        raise

os.replace() is atomic on POSIX filesystems (same filesystem). On Windows, it’s atomic for NTFS.

Handling Large Directory Trees

Efficient recursive file processing

from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_file(path: Path) -> dict:
    stat = path.stat()
    return {
        "path": str(path),
        "size": stat.st_size,
        "modified": stat.st_mtime,
    }

def scan_tree(root: Path, pattern: str = "*") -> list[dict]:
    files = list(root.rglob(pattern))

    with ThreadPoolExecutor(max_workers=8) as pool:
        results = list(pool.map(process_file, files))

    return results

Directory size calculation

def dir_size(path: Path) -> int:
    """Total size of all files in directory tree."""
    return sum(f.stat().st_size for f in path.rglob("*") if f.is_file())

stat() and File Metadata

from pathlib import Path
import datetime

p = Path("important.log")
s = p.stat()

s.st_size       # File size in bytes
s.st_mtime      # Modification time (Unix timestamp)
s.st_ctime      # Creation time (Windows) / metadata change (Unix)
s.st_mode       # File permissions (octal)
s.st_uid        # Owner user ID (Unix)
s.st_nlink      # Number of hard links

# Human-readable modification time
modified = datetime.datetime.fromtimestamp(s.st_mtime)

# lstat() for symlinks — doesn't follow the link
if p.is_symlink():
    link_stat = p.lstat()  # Stats of the link itself
    target = p.resolve()    # Where the link points

Path Comparison and Hashing

Paths are compared case-sensitively on POSIX and case-insensitively on Windows:

# On Linux:
Path("File.txt") == Path("file.txt")  # False

# On Windows:
WindowsPath("File.txt") == WindowsPath("file.txt")  # True

Paths are hashable and can be used as dictionary keys or set members:

seen = set()
for p in root.rglob("*"):
    resolved = p.resolve()
    if resolved not in seen:
        seen.add(resolved)
        process(p)

Using resolve() before adding to a set normalizes symlinks and .. components.

Production Pattern: Configuration File Finder

from pathlib import Path
from typing import Optional

def find_config(
    name: str = ".myapp.toml",
    start: Optional[Path] = None
) -> Optional[Path]:
    """Search up the directory tree for a config file."""
    current = (start or Path.cwd()).resolve()

    while True:
        candidate = current / name
        if candidate.is_file():
            return candidate
        parent = current.parent
        if parent == current:  # Root directory
            return None
        current = parent

Production Pattern: Safe Temporary Files

from pathlib import Path
import tempfile
from contextlib import contextmanager

@contextmanager
def temp_directory(prefix: str = "myapp_"):
    """Create a temporary directory that cleans itself up."""
    tmp = Path(tempfile.mkdtemp(prefix=prefix))
    try:
        yield tmp
    finally:
        import shutil
        shutil.rmtree(tmp, ignore_errors=True)

with temp_directory() as tmp:
    data_file = tmp / "processed.json"
    data_file.write_text('{"result": 42}')
    # Directory and contents deleted on exit

Migrating from os.path

A systematic approach:

# Before
import os

def old_style():
    base = os.path.dirname(os.path.abspath(__file__))
    config = os.path.join(base, "config", "settings.ini")
    if os.path.exists(config):
        with open(config) as f:
            return f.read()
    return None

# After
from pathlib import Path

def new_style():
    base = Path(__file__).resolve().parent
    config = base / "config" / "settings.ini"
    if config.exists():
        return config.read_text()
    return None

The pathlib version is 30% shorter and reads more naturally. Most standard library and third-party libraries now accept Path objects.

One thing to remember: pathlib provides an object-oriented filesystem API built on two layers — PurePath for string manipulation and Path for I/O operations. The __fspath__ protocol ensures compatibility with the entire Python ecosystem. For production code: use resolve() to normalize paths, atomic writes for data integrity, and rglob() for recursive traversal. pathlib is now Python’s recommended path handling approach — os.path is legacy.

pythonstandard-libraryfilesystem