pathlib Filesystem — Deep Dive
The pathlib Class Hierarchy
pathlib has two layers:
PurePath (no I/O — pure path manipulation)
├── PurePosixPath
└── PureWindowsPath
Path(PurePath) (adds I/O operations)
├── PosixPath(Path, PurePosixPath)
└── WindowsPath(Path, PureWindowsPath)
PurePath handles string manipulation: joining, splitting, matching. Path adds filesystem operations: reading, writing, checking existence. This separation means you can manipulate Windows paths on Linux (for config parsing, URL building) without needing actual Windows filesystem access.
from pathlib import PureWindowsPath
# Parse Windows paths on Linux
p = PureWindowsPath(r"C:\Users\Alice\report.pdf")
print(p.name) # 'report.pdf'
print(p.parent) # PureWindowsPath('C:/Users/Alice')
print(p.drive) # 'C:'
The fspath Protocol (PEP 519)
Many libraries still expect string paths. Python 3.6 introduced os.fspath() and the __fspath__ protocol so Path objects work seamlessly:
import os
from pathlib import Path
p = Path("/tmp/data.csv")
os.fspath(p) # '/tmp/data.csv' — string for C libraries
# These all accept Path objects directly:
open(p) # Built-in open
os.stat(p) # os module
shutil.copy(p, dst) # shutil
Under the hood, functions call os.fspath(arg) which checks for __fspath__() method. If you build custom path-like objects, implement this method to integrate with the ecosystem.
The / Operator Internals
The / operator for path joining is implemented via __truediv__ and __rtruediv__:
class PurePath:
def __truediv__(self, key):
return self._make_child((key,))
def __rtruediv__(self, key):
return type(self)(key, self)
This allows both directions:
Path("/home") / "alice" # Path.__truediv__("alice")
"/home" / Path("alice") # str doesn't know Path, so Python calls Path.__rtruediv__("/home")
Glob Implementation Details
Path.glob() uses fnmatch patterns, not regular expressions:
| Pattern | Meaning |
|---|---|
* | Match everything except / |
** | Match zero or more directories (recursive) |
? | Match any single character |
[abc] | Match one of a, b, or c |
[!abc] | Match anything except a, b, or c |
Performance of glob vs os.scandir
Path.glob() internally uses os.scandir(), which is significantly faster than os.listdir() because it retrieves file type information alongside names (via readdir on Unix, FindNextFile on Windows):
from pathlib import Path
import time
# glob is lazy — it yields results as found
start = time.perf_counter()
for p in Path("/usr").rglob("*.py"):
pass # Process one at a time, no memory spike
elapsed = time.perf_counter() - start
For very large directories (100K+ files), consider os.scandir() directly if you need maximum control over buffering and filtering.
Walk pattern (Python 3.12+)
# New in 3.12: Path.walk() — replaces os.walk()
for dirpath, dirnames, filenames in Path("/project").walk():
for name in filenames:
if name.endswith(".py"):
print(dirpath / name)
Atomic File Operations
pathlib’s write_text() and write_bytes() are not atomic — a crash during write can leave a corrupt file. For production use:
from pathlib import Path
import tempfile
import os
def atomic_write(path: Path, content: str, encoding: str = "utf-8"):
"""Write atomically using temp file + rename."""
path = Path(path)
fd, tmp_path = tempfile.mkstemp(
dir=path.parent,
prefix=f".{path.name}.",
suffix=".tmp"
)
try:
with os.fdopen(fd, 'w', encoding=encoding) as f:
f.write(content)
os.replace(tmp_path, path) # Atomic on POSIX
except:
os.unlink(tmp_path)
raise
os.replace() is atomic on POSIX filesystems (same filesystem). On Windows, it’s atomic for NTFS.
Handling Large Directory Trees
Efficient recursive file processing
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def process_file(path: Path) -> dict:
stat = path.stat()
return {
"path": str(path),
"size": stat.st_size,
"modified": stat.st_mtime,
}
def scan_tree(root: Path, pattern: str = "*") -> list[dict]:
files = list(root.rglob(pattern))
with ThreadPoolExecutor(max_workers=8) as pool:
results = list(pool.map(process_file, files))
return results
Directory size calculation
def dir_size(path: Path) -> int:
"""Total size of all files in directory tree."""
return sum(f.stat().st_size for f in path.rglob("*") if f.is_file())
stat() and File Metadata
from pathlib import Path
import datetime
p = Path("important.log")
s = p.stat()
s.st_size # File size in bytes
s.st_mtime # Modification time (Unix timestamp)
s.st_ctime # Creation time (Windows) / metadata change (Unix)
s.st_mode # File permissions (octal)
s.st_uid # Owner user ID (Unix)
s.st_nlink # Number of hard links
# Human-readable modification time
modified = datetime.datetime.fromtimestamp(s.st_mtime)
# lstat() for symlinks — doesn't follow the link
if p.is_symlink():
link_stat = p.lstat() # Stats of the link itself
target = p.resolve() # Where the link points
Path Comparison and Hashing
Paths are compared case-sensitively on POSIX and case-insensitively on Windows:
# On Linux:
Path("File.txt") == Path("file.txt") # False
# On Windows:
WindowsPath("File.txt") == WindowsPath("file.txt") # True
Paths are hashable and can be used as dictionary keys or set members:
seen = set()
for p in root.rglob("*"):
resolved = p.resolve()
if resolved not in seen:
seen.add(resolved)
process(p)
Using resolve() before adding to a set normalizes symlinks and .. components.
Production Pattern: Configuration File Finder
from pathlib import Path
from typing import Optional
def find_config(
name: str = ".myapp.toml",
start: Optional[Path] = None
) -> Optional[Path]:
"""Search up the directory tree for a config file."""
current = (start or Path.cwd()).resolve()
while True:
candidate = current / name
if candidate.is_file():
return candidate
parent = current.parent
if parent == current: # Root directory
return None
current = parent
Production Pattern: Safe Temporary Files
from pathlib import Path
import tempfile
from contextlib import contextmanager
@contextmanager
def temp_directory(prefix: str = "myapp_"):
"""Create a temporary directory that cleans itself up."""
tmp = Path(tempfile.mkdtemp(prefix=prefix))
try:
yield tmp
finally:
import shutil
shutil.rmtree(tmp, ignore_errors=True)
with temp_directory() as tmp:
data_file = tmp / "processed.json"
data_file.write_text('{"result": 42}')
# Directory and contents deleted on exit
Migrating from os.path
A systematic approach:
# Before
import os
def old_style():
base = os.path.dirname(os.path.abspath(__file__))
config = os.path.join(base, "config", "settings.ini")
if os.path.exists(config):
with open(config) as f:
return f.read()
return None
# After
from pathlib import Path
def new_style():
base = Path(__file__).resolve().parent
config = base / "config" / "settings.ini"
if config.exists():
return config.read_text()
return None
The pathlib version is 30% shorter and reads more naturally. Most standard library and third-party libraries now accept Path objects.
One thing to remember: pathlib provides an object-oriented filesystem API built on two layers — PurePath for string manipulation and Path for I/O operations. The __fspath__ protocol ensures compatibility with the entire Python ecosystem. For production code: use resolve() to normalize paths, atomic writes for data integrity, and rglob() for recursive traversal. pathlib is now Python’s recommended path handling approach — os.path is legacy.
See Also
- Python Atexit How Python's atexit module lets your program clean up after itself right before it shuts down.
- Python Bisect Sorted Lists How Python's bisect module finds things in sorted lists the way you'd find a word in a dictionary — by jumping to the middle.
- Python Contextlib How Python's contextlib module makes the 'with' statement work for anything, not just files.
- Python Copy Module Why copying data in Python isn't as simple as it sounds, and how the copy module prevents sneaky bugs.
- Python Dataclass Field Metadata How Python dataclass fields can carry hidden notes — like sticky notes on a filing cabinet that tools read automatically.