LLM Function Calling in Python — Deep Dive

Function calling is the mechanism that turns LLMs from text generators into orchestrators of real-world actions. Getting it right in production requires careful schema design, safe dispatch, and handling the differences between providers.

1) Schema design principles

Tool schemas are your API contract with the model. They determine call accuracy more than any prompt engineering.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the current status of a customer order by order ID. Returns status, estimated delivery date, and tracking URL.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order identifier, e.g. 'ORD-12345'",
                        "pattern": "^ORD-\\d+$"
                    }
                },
                "required": ["order_id"],
                "additionalProperties": False
            }
        }
    }
]

Key practices:

  • Use additionalProperties: false to prevent hallucinated parameters.
  • Add pattern or enum constraints where possible — models respect them.
  • Write descriptions that explain both the function’s purpose and its return value.
  • Keep parameter counts low (under 5). Complex inputs reduce accuracy.

2) Dispatch architecture

Build a registry that maps function names to callables with validation:

from pydantic import BaseModel, ValidationError
from typing import Callable, Any

class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, tuple[Callable, type[BaseModel] | None]] = {}

    def register(self, name: str, fn: Callable, schema: type[BaseModel] | None = None):
        self._tools[name] = (fn, schema)

    def execute(self, name: str, arguments: dict[str, Any]) -> str:
        if name not in self._tools:
            return f"Error: unknown tool '{name}'"
        fn, schema = self._tools[name]
        if schema:
            try:
                validated = schema.model_validate(arguments)
                arguments = validated.model_dump()
            except ValidationError as e:
                return f"Validation error: {e}"
        try:
            result = fn(**arguments)
            return str(result)
        except Exception as e:
            return f"Execution error: {e}"

registry = ToolRegistry()

Always validate arguments before execution. The model can hallucinate parameter values. Never pass raw model output to security-sensitive functions (database queries, file operations) without validation.

3) The agentic loop

A complete function-calling loop handles multiple rounds:

from openai import OpenAI

client = OpenAI()

def run_agent(messages: list[dict], tools: list[dict], max_rounds: int = 10) -> str:
    for _ in range(max_rounds):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )
        msg = response.choices[0].message
        messages.append(msg.model_dump())

        if not msg.tool_calls:
            return msg.content

        for tc in msg.tool_calls:
            result = registry.execute(tc.function.name, tc.function.arguments_dict)
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result,
            })
    return "Max rounds reached without final answer"

The max_rounds cap is essential. Without it, a model that keeps requesting tools can loop indefinitely, burning tokens and budget.

4) Provider differences

OpenAI uses tools parameter with tool_calls in responses. Supports parallel tool calls natively.

Anthropic uses tools in the API but returns tool use as content blocks with type: "tool_use". Results go back as tool_result content blocks. Supports parallel calls.

Google (Gemini) uses function_declarations and returns function_call parts. Different JSON structure but same conceptual loop.

For multi-provider support, abstract the tool-call extraction and result formatting behind a provider interface:

class ToolCallExtractor:
    def extract(self, response) -> list[tuple[str, str, dict]]:
        """Returns list of (call_id, function_name, arguments)"""
        raise NotImplementedError

class OpenAIExtractor(ToolCallExtractor):
    def extract(self, response):
        msg = response.choices[0].message
        if not msg.tool_calls:
            return []
        return [(tc.id, tc.function.name, tc.function.arguments_dict)
                for tc in msg.tool_calls]

5) Streaming tool calls

For real-time UIs, you need to handle tool calls that arrive in chunks:

def stream_with_tools(messages, tools):
    tool_calls_buffer = {}
    stream = client.chat.completions.create(
        model="gpt-4o", messages=messages, tools=tools, stream=True
    )
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.tool_calls:
            for tc_delta in delta.tool_calls:
                idx = tc_delta.index
                if idx not in tool_calls_buffer:
                    tool_calls_buffer[idx] = {"id": "", "name": "", "args": ""}
                if tc_delta.id:
                    tool_calls_buffer[idx]["id"] = tc_delta.id
                if tc_delta.function:
                    if tc_delta.function.name:
                        tool_calls_buffer[idx]["name"] = tc_delta.function.name
                    if tc_delta.function.arguments:
                        tool_calls_buffer[idx]["args"] += tc_delta.function.arguments
        elif delta.content:
            yield delta.content  # stream text to user
    # after stream ends, execute buffered tool calls
    for tc in tool_calls_buffer.values():
        yield registry.execute(tc["name"], json.loads(tc["args"]))

6) Security considerations

Function calling introduces a code-execution surface. Mitigations:

  • Allowlist functions — only register tools the model should access. Never expose arbitrary code execution.
  • Validate all arguments — Pydantic models catch type mismatches and constraint violations.
  • Rate-limit tool execution — prevent the model from calling expensive operations in a loop.
  • Sandbox dangerous operations — database writes, file system access, and network calls should go through permission-checked wrappers.
  • Log everything — record every tool call with arguments and results for audit trails.

7) Testing strategies

Unit-test each tool function independently. For integration tests, record model responses with VCR.py and replay them. This avoids API costs and makes tests deterministic.

Test edge cases: what happens when the model calls a tool that does not exist? When arguments fail validation? When a tool raises an exception? Your dispatch layer should handle all of these gracefully.

8) Performance optimization

  • Minimize tool count — models slow down and make worse choices with more than 10-15 tools. Group related functions or use a two-stage approach where one call selects a category, then a second call gets the specific tools.
  • Cache tool results — if the same query appears twice in a conversation, return the cached result.
  • Use cheaper models for routing — a fast model picks which tool to call, then a stronger model interprets the result.

The one thing to remember: Function calling is a structured interface between LLMs and your code — invest in schema quality, argument validation, and dispatch safety because the model’s suggestions flow directly into your execution layer.

pythonllm-appsfunction-callingopenaianthropic

See Also

  • Python Agent Frameworks An agent framework gives AI the ability to plan, use tools, and work through problems step by step — like upgrading a calculator into a research assistant.
  • Python Embedding Pipelines An embedding pipeline turns words into numbers that capture meaning — like translating every sentence into coordinates on a giant map of ideas.
  • Python Guardrails Ai Guardrails are safety bumpers for AI — they check what the model says before it reaches users, like a spellchecker but for facts, tone, and dangerous content.
  • Python Llm Evaluation Harness An LLM evaluation harness is like a report card for AI — it runs tests and grades how well the model answers questions so you know if it is actually improving.
  • Python Prompt Chaining Think of prompt chaining as a relay race where each runner hands a baton to the next — except the runners are AI prompts building on each other's work.