Strategies API Reference

LLMCallStrategy

async_batch_llm.LLMCallStrategy

Bases: ABC, Generic[TOutput]

Abstract base class for LLM call strategies.

A strategy encapsulates how LLM calls are made, including: - Resource initialization (caches, clients) - Call execution with retries - Resource cleanup

The framework calls: 1. prepare() once per unique strategy instance before its first execution 2. execute() for each attempt (including retries) 3. cleanup() once per prepared strategy when the processor exits or shuts down

cleanup `async`

cleanup() -> None

Clean up resources when the processor exits or shuts down.

Called once per prepared strategy instance, not once per work item.

Use this for: - Closing connections/sessions - Releasing locks - Logging final metrics - Deleting temporary files

Do NOT use this for: - Deleting caches intended for reuse across runs - Destructive cleanup that prevents resource reuse

Note on Caches (v0.2.0): For reusable resources like Gemini caches with TTLs, consider letting them expire naturally to enable cost savings across multiple pipeline runs. See GeminiCachedModel for an example.

Default: no-op

Source code in src/async_batch_llm/llm_strategies.py

async def cleanup(self) -> None:
    """
    Clean up resources when the processor exits or shuts down.

    Called once per prepared strategy instance, not once per work item.

    **Use this for:**
    - Closing connections/sessions
    - Releasing locks
    - Logging final metrics
    - Deleting temporary files

    **Do NOT use this for:**
    - Deleting caches intended for reuse across runs
    - Destructive cleanup that prevents resource reuse

    **Note on Caches (v0.2.0):**
    For reusable resources like Gemini caches with TTLs, consider letting
    them expire naturally to enable cost savings across multiple pipeline
    runs. See `GeminiCachedModel` for an example.

    Default: no-op
    """
    pass

dry_run `async`

dry_run(prompt: str) -> tuple[TOutput, TokenUsage]

Return mock output for dry-run mode (testing without API calls).

Override this method to provide realistic mock data for testing. Default implementation returns placeholder values that may not match your output type.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt that would have been sent to the LLM	required

Returns:

Type	Description
`tuple[TOutput, TokenUsage]`	Tuple of (mock_output, mock_token_usage)

Default behavior: - Returns string "[DRY-RUN] Mock output" as output - Returns mock token usage: 100 input, 50 output, 150 total

Source code in src/async_batch_llm/llm_strategies.py

async def dry_run(self, prompt: str) -> tuple[TOutput, TokenUsage]:
    """
    Return mock output for dry-run mode (testing without API calls).

    Override this method to provide realistic mock data for testing.
    Default implementation returns placeholder values that may not match
    your output type.

    Args:
        prompt: The prompt that would have been sent to the LLM

    Returns:
        Tuple of (mock_output, mock_token_usage)

    Default behavior:
    - Returns string "[DRY-RUN] Mock output" as output
    - Returns mock token usage: 100 input, 50 output, 150 total
    """
    mock_output: TOutput = f"[DRY-RUN] Mock output for prompt: {prompt[:50]}..."  # ty:ignore[invalid-assignment]
    mock_tokens: TokenUsage = {
        "input_tokens": 100,
        "output_tokens": 50,
        "total_tokens": 150,
    }
    return mock_output, mock_tokens

execute `abstractmethod` `async`

execute(prompt: str, attempt: int, timeout: float, state: RetryState | None = None) -> tuple[TOutput, TokenUsage] | tuple[TOutput, TokenUsage, dict[str, Any] | None]

Execute an LLM call for the given attempt.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt to send to the LLM	required
`attempt`	`int`	Which logical retry attempt this is (1, 2, 3, ...). Guarantee: rate-limit errors do NOT advance this number. If a call is throttled (429 / coordinated cooldown), the framework retries the same `attempt` after the cooldown rather than burning an attempt. So a strategy that escalates models on higher attempts (e.g. attempt >= 2 → smarter model) escalates because the output was bad over `max_attempts` tries, never merely because the endpoint was busy. Rate-limit retries are bounded separately by `RetryConfig.max_rate_limit_retries`.	required
`timeout`	`float`	Maximum time to wait for response (seconds)	required
`state`	`RetryState \| None`	Optional retry state that persists across attempts (v0.3.0)	`None`

Returns:

Type	Description
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	Either a 2-tuple `(output, token_usage)` or a 3-tuple
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	`(output, token_usage, metadata)`. `token_usage` is a
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	TokenUsage dict with optional keys `input_tokens`,
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	`output_tokens`, `total_tokens`, `cached_input_tokens`.
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	`metadata` (v0.10.0) is a provider-specific dict forwarded into
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	`WorkItemResult.metadata` — typically `finish_reason`,
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	`model`, `provider` (OpenRouter), `safety_ratings` (Gemini);
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	pass `None` if you have nothing to surface. The 2-tuple shape is
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	supported for backward compatibility but will be removed in a
`tuple[TOutput, TokenUsage] \| tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	future release; built-in strategies all return the 3-tuple shape.

Raises:

Type	Description
`Exception`	Any exception propagated to trigger a retry (if retryable) or a permanent failure.

Note (v0.3.0): The state parameter allows strategies to maintain state across retry attempts for multi-stage retry patterns. See RetryState documentation for examples.

Source code in src/async_batch_llm/llm_strategies.py

@abstractmethod
async def execute(
    self, prompt: str, attempt: int, timeout: float, state: "RetryState | None" = None
) -> tuple[TOutput, TokenUsage] | tuple[TOutput, TokenUsage, dict[str, Any] | None]:
    """
    Execute an LLM call for the given attempt.

    Args:
        prompt: The prompt to send to the LLM
        attempt: Which *logical* retry attempt this is (1, 2, 3, ...).
            **Guarantee:** rate-limit errors do NOT advance this number. If a
            call is throttled (429 / coordinated cooldown), the framework
            retries the *same* ``attempt`` after the cooldown rather than
            burning an attempt. So a strategy that escalates models on higher
            attempts (e.g. attempt >= 2 → smarter model) escalates because the
            *output* was bad over ``max_attempts`` tries, never merely because
            the endpoint was busy. Rate-limit retries are bounded separately
            by ``RetryConfig.max_rate_limit_retries``.
        timeout: Maximum time to wait for response (seconds)
        state: Optional retry state that persists across attempts (v0.3.0)

    Returns:
        Either a 2-tuple ``(output, token_usage)`` or a 3-tuple
        ``(output, token_usage, metadata)``. ``token_usage`` is a
        TokenUsage dict with optional keys ``input_tokens``,
        ``output_tokens``, ``total_tokens``, ``cached_input_tokens``.
        ``metadata`` (v0.10.0) is a provider-specific dict forwarded into
        ``WorkItemResult.metadata`` — typically ``finish_reason``,
        ``model``, ``provider`` (OpenRouter), ``safety_ratings`` (Gemini);
        pass ``None`` if you have nothing to surface. The 2-tuple shape is
        supported for backward compatibility but will be removed in a
        future release; built-in strategies all return the 3-tuple shape.

    Raises:
        Exception: Any exception propagated to trigger a retry (if
            retryable) or a permanent failure.

    Note (v0.3.0):
        The state parameter allows strategies to maintain state across retry
        attempts for multi-stage retry patterns. See RetryState documentation
        for examples.
    """
    pass

on_error `async`

on_error(exception: Exception, attempt: int, state: RetryState | None = None) -> None

Handle errors that occur during execute().

Called by the framework when execute() raises an exception, before deciding whether to retry. This allows strategies to: - Inspect the error type to adjust retry behavior - Store error information for use in next attempt - Modify prompts based on validation errors - Track error patterns across attempts

Parameters:

Name	Type	Description	Default
`exception`	`Exception`	The exception that was raised during execute()	required
`attempt`	`int`	Which attempt number failed (1, 2, 3, ...)	required
`state`	`RetryState \| None`	Optional retry state that persists across attempts (v0.3.0)	`None`

Default: no-op

Example (v0.2.0): async def on_error(self, exception: Exception, attempt: int) -> None: # Store last error for smart retry logic self.last_error = exception

    # Track validation errors vs network errors
    if isinstance(exception, ValidationError):
        self.should_escalate_model = True

Example (v0.3.0 with retry state): async def on_error( self, exception: Exception, attempt: int, state: RetryState | None = None ) -> None: if state: # Track validation errors separately from other errors if isinstance(exception, ValidationError): count = state.get('validation_failures', 0) + 1 state.set('validation_failures', count) # Save partial results for recovery if hasattr(exception, 'partial_data'): state.set('partial_data', exception.partial_data)

Source code in src/async_batch_llm/llm_strategies.py

async def on_error(
    self, exception: Exception, attempt: int, state: "RetryState | None" = None
) -> None:
    """
    Handle errors that occur during execute().

    Called by the framework when execute() raises an exception, before
    deciding whether to retry. This allows strategies to:
    - Inspect the error type to adjust retry behavior
    - Store error information for use in next attempt
    - Modify prompts based on validation errors
    - Track error patterns across attempts

    Args:
        exception: The exception that was raised during execute()
        attempt: Which attempt number failed (1, 2, 3, ...)
        state: Optional retry state that persists across attempts (v0.3.0)

    Default: no-op

    Example (v0.2.0):
        async def on_error(self, exception: Exception, attempt: int) -> None:
            # Store last error for smart retry logic
            self.last_error = exception

            # Track validation errors vs network errors
            if isinstance(exception, ValidationError):
                self.should_escalate_model = True

    Example (v0.3.0 with retry state):
        async def on_error(
            self, exception: Exception, attempt: int, state: RetryState | None = None
        ) -> None:
            if state:
                # Track validation errors separately from other errors
                if isinstance(exception, ValidationError):
                    count = state.get('validation_failures', 0) + 1
                    state.set('validation_failures', count)
                    # Save partial results for recovery
                    if hasattr(exception, 'partial_data'):
                        state.set('partial_data', exception.partial_data)
    """
    pass

prepare `async`

prepare() -> None

Initialize resources before making any LLM calls.

Called once per unique strategy instance before the first work item using that instance executes. Use this to set up shared caches, clients, etc. Per-item retry state belongs in execute()/on_error() via RetryState.

Default: no-op

Source code in src/async_batch_llm/llm_strategies.py

async def prepare(self) -> None:
    """
    Initialize resources before making any LLM calls.

    Called once per unique strategy instance before the first work item using
    that instance executes. Use this to set up shared caches, clients, etc.
    Per-item retry state belongs in execute()/on_error() via RetryState.

    Default: no-op
    """
    pass

recommended_error_classifier

recommended_error_classifier() -> ErrorClassifier | None

Return the error classifier best suited to this strategy's provider.

:class:~async_batch_llm.ParallelBatchProcessor calls this to auto-select a classifier when the caller didn't pass error_classifier explicitly — it reads the recommendation off the work items' strategies.

Returns None by default ("no preference"), which lets the framework fall back to :class:DefaultErrorClassifier. Provider strategies (GeminiStrategy, OpenAIStrategy, …) override this to return their matching classifier. An explicit error_classifier= on the processor always wins over this recommendation.

Source code in src/async_batch_llm/llm_strategies.py

def recommended_error_classifier(self) -> "ErrorClassifier | None":
    """Return the error classifier best suited to this strategy's provider.

    :class:`~async_batch_llm.ParallelBatchProcessor` calls this to
    auto-select a classifier when the caller didn't pass ``error_classifier``
    explicitly — it reads the recommendation off the work items' strategies.

    Returns ``None`` by default ("no preference"), which lets the framework
    fall back to :class:`DefaultErrorClassifier`. Provider strategies
    (``GeminiStrategy``, ``OpenAIStrategy``, …) override this to return their
    matching classifier. An explicit ``error_classifier=`` on the processor
    always wins over this recommendation.
    """
    return None

ModelStrategy

Shared base for the provider-named strategies below; delegates to an LLMModel. Use directly for a custom model you don't want a dedicated subclass for.

async_batch_llm.ModelStrategy

ModelStrategy(model: LLMModel, response_parser: None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

ModelStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput], *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

ModelStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput] | None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

Bases: LLMCallStrategy[TOutput]

Base strategy for any provider exposed as an :class:LLMModel.

Holds the machinery shared by all model-backed strategies: the model reference, an optional response parser, lifecycle delegation to :class:ManagedLLMModel, and an execute() that calls model.generate(), parses the response, and forwards LLMResponse.metadata as the third tuple element.

The provider-named subclasses (:class:GeminiStrategy, :class:OpenAIStrategy, :class:OpenRouterStrategy) are thin shells over this base — they exist so users can pick the strategy named after the provider they're using. Use this base directly for a custom :class:LLMModel you don't want to name a dedicated subclass for.

Added in v0.10.0 (extracted from the formerly-duplicated provider strategy classes).

Initialize strategy.

Parameters:

Name	Type	Description	Default
`model`	`LLMModel`	An LLMModel instance (e.g., GeminiModel, OpenAIModel).	required
`response_parser`	`Callable[[LLMResponse], TOutput] \| None`	Function to parse LLMResponse into TOutput. Defaults to returning `response.text` — only valid when TOutput is `str`. When TOutput is any other type, pass a response_parser (enforced by the @overload signatures).	`None`
`temperature`	`float \| None`	Default sampling temperature. Pass `None` to omit the parameter and use the provider default (e.g. for OpenAI reasoning models that reject an explicit temperature).	`0.0`
`generation_config`	`dict[str, Any] \| None`	Provider-specific config forwarded to `model.generate(config=...)` on every call — e.g. Gemini's `response_schema` / `response_mime_type` / `tools`, or an OpenAI-compatible `response_format` / `max_tokens` (merged into the model's `extra_body`). Lets a built-in strategy carry native structured output or grounding without subclassing `execute()`. A subclass that overrides `execute()` and needs a per-attempt config can read `self.generation_config` and merge.	`None`

Source code in src/async_batch_llm/llm_strategies.py

def __init__(
    self,
    model: "LLMModel",
    response_parser: Callable[[LLMResponse], TOutput] | None = None,
    *,
    temperature: float | None = 0.0,
    generation_config: dict[str, Any] | None = None,
) -> None:
    """
    Initialize strategy.

    Args:
        model: An LLMModel instance (e.g., GeminiModel, OpenAIModel).
        response_parser: Function to parse LLMResponse into TOutput. Defaults to
            returning ``response.text`` — only valid when TOutput is ``str``.
            When TOutput is any other type, pass a response_parser (enforced
            by the @overload signatures).
        temperature: Default sampling temperature. Pass ``None`` to omit the
            parameter and use the provider default (e.g. for OpenAI
            reasoning models that reject an explicit temperature).
        generation_config: Provider-specific config forwarded to
            ``model.generate(config=...)`` on every call — e.g. Gemini's
            ``response_schema`` / ``response_mime_type`` / ``tools``, or an
            OpenAI-compatible ``response_format`` / ``max_tokens`` (merged
            into the model's ``extra_body``). Lets a built-in strategy carry
            native structured output or grounding without subclassing
            ``execute()``. A subclass that overrides ``execute()`` and needs a
            *per-attempt* config can read ``self.generation_config`` and merge.
    """
    self.model = model
    # The overloads restrict the None-parser path to TOutput=str, so the cast
    # below is sound at static-analysis time.
    self.response_parser = response_parser or (lambda response: cast(TOutput, response.text))
    self.temperature = temperature
    self.generation_config = generation_config

cleanup `async`

cleanup() -> None

Delegate to model.cleanup() if the model has a managed lifecycle.

Source code in src/async_batch_llm/llm_strategies.py

async def cleanup(self) -> None:
    """Delegate to model.cleanup() if the model has a managed lifecycle."""
    if isinstance(self.model, ManagedLLMModel):
        await self.model.cleanup()

execute `async`

execute(prompt: str, attempt: int, timeout: float, state: RetryState | None = None) -> tuple[TOutput, TokenUsage, dict[str, Any] | None]

Execute the LLM call via the model and parse the response.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt to send to the LLM.	required
`attempt`	`int`	Which retry attempt this is (1, 2, 3, ...).	required
`timeout`	`float`	Maximum time for response (enforced by the framework).	required
`state`	`RetryState \| None`	Optional retry state for cross-attempt persistence.	`None`

Returns:

Type	Description
`TOutput`	3-tuple `(parsed_output, token_usage, metadata)` where `metadata`
`TokenUsage`	is forwarded from `LLMResponse.metadata` (provider, finish_reason,
`dict[str, Any] \| None`	model, safety_ratings, etc.). Added the metadata slot in v0.10.0; the
`tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	framework still accepts the legacy 2-tuple shape from custom
`tuple[TOutput, TokenUsage, dict[str, Any] \| None]`	strategies via a compat shim.

Source code in src/async_batch_llm/llm_strategies.py

async def execute(
    self, prompt: str, attempt: int, timeout: float, state: RetryState | None = None
) -> tuple[TOutput, TokenUsage, dict[str, Any] | None]:
    """Execute the LLM call via the model and parse the response.

    Args:
        prompt: The prompt to send to the LLM.
        attempt: Which retry attempt this is (1, 2, 3, ...).
        timeout: Maximum time for response (enforced by the framework).
        state: Optional retry state for cross-attempt persistence.

    Returns:
        3-tuple ``(parsed_output, token_usage, metadata)`` where ``metadata``
        is forwarded from ``LLMResponse.metadata`` (provider, finish_reason,
        model, safety_ratings, etc.). Added the metadata slot in v0.10.0; the
        framework still accepts the legacy 2-tuple shape from custom
        strategies via a compat shim.
    """
    # Only pass `config` when set, so a custom LLMModel whose generate()
    # doesn't accept the kwarg keeps working — the default path is identical
    # to before generation_config existed.
    gen_kwargs: dict[str, Any] = {"temperature": self.temperature}
    if self.generation_config is not None:
        gen_kwargs["config"] = self.generation_config
    llm_response = await self.model.generate(prompt, **gen_kwargs)

    try:
        output = self.response_parser(llm_response)
    except Exception as e:
        # The API already billed for this call even though parsing failed —
        # attach the token usage so the framework can account for it.
        _attach_token_usage(e, llm_response.token_usage)

    return output, llm_response.token_usage, llm_response.metadata

prepare `async`

prepare() -> None

Delegate to model.prepare() if the model has a managed lifecycle.

Source code in src/async_batch_llm/llm_strategies.py

async def prepare(self) -> None:
    """Delegate to model.prepare() if the model has a managed lifecycle."""
    if isinstance(self.model, ManagedLLMModel):
        await self.model.prepare()

PydanticAIStrategy

async_batch_llm.PydanticAIStrategy

PydanticAIStrategy(agent: Agent[None, TOutput])

Bases: LLMCallStrategy[TOutput]

Strategy for using PydanticAI agents.

This strategy wraps a PydanticAI agent, providing a clean interface for batch processing. The agent handles all model interaction, validation, and parsing.

Best for: Structured output with Pydantic models, using PydanticAI's features.

Initialize PydanticAI strategy.

Parameters:

Name	Type	Description	Default
`agent`	`Agent[None, TOutput]`	Configured PydanticAI agent	required

Source code in src/async_batch_llm/llm_strategies.py

def __init__(self, agent: "Agent[None, TOutput]"):
    """
    Initialize PydanticAI strategy.

    Args:
        agent: Configured PydanticAI agent
    """
    if Agent is Any:
        raise ImportError(
            "pydantic-ai is required for PydanticAIStrategy. "
            "Install with: pip install 'async-batch-llm[pydantic-ai]'"
        )

    self.agent = agent

dry_run `async`

dry_run(prompt: str) -> tuple[TOutput, TokenUsage]

Return mock output based on agent's result_type for dry-run mode.

Source code in src/async_batch_llm/llm_strategies.py

async def dry_run(self, prompt: str) -> tuple[TOutput, TokenUsage]:
    """Return mock output based on agent's result_type for dry-run mode."""
    # Try to create a mock instance of the expected output type
    try:
        from pydantic import BaseModel

        # pydantic-ai 1.x renamed Agent.result_type -> Agent.output_type
        # (the old attribute no longer exists). Prefer the new name, fall
        # back to the legacy one for 0.x agents.
        result_type = getattr(self.agent, "output_type", None)
        if result_type is None:
            result_type = getattr(self.agent, "result_type", None)

        # If output type is a Pydantic model, try to create an instance
        if isinstance(result_type, type) and issubclass(result_type, BaseModel):
            # Use model_construct to create instance without validation
            # This allows creating instances even with required fields
            mock_output: TOutput = result_type.model_construct()
        else:
            # For non-Pydantic types, use base class default
            return await super().dry_run(prompt)

    except Exception:
        # If anything fails, fall back to base class default
        return await super().dry_run(prompt)

    # Return mock output with realistic token usage
    mock_tokens: TokenUsage = {
        "input_tokens": len(prompt.split()),  # Rough estimate
        "output_tokens": 50,
        "total_tokens": len(prompt.split()) + 50,
    }

    return mock_output, mock_tokens

execute `async`

execute(prompt: str, attempt: int, timeout: float, state: RetryState | None = None) -> tuple[TOutput, TokenUsage, dict[str, Any] | None]

Execute PydanticAI agent call.

Note: timeout parameter is provided for information but timeout enforcement is handled by the framework wrapping this call in asyncio.wait_for().

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt to send to the LLM	required
`attempt`	`int`	Which retry attempt this is (1, 2, 3, ...)	required
`timeout`	`float`	Maximum time to wait for response (seconds)	required
`state`	`RetryState \| None`	Optional retry state (v0.3.0, unused by this strategy)	`None`

Returns:

Type	Description
`TOutput`	3-tuple `(output, token_usage, metadata)`. PydanticAI's result
`TokenUsage`	object doesn't expose provider-side metadata uniformly, so
`dict[str, Any] \| None`	`metadata` is currently always `None` here. (v0.10.0)

Source code in src/async_batch_llm/llm_strategies.py

async def execute(
    self, prompt: str, attempt: int, timeout: float, state: RetryState | None = None
) -> tuple[TOutput, TokenUsage, dict[str, Any] | None]:
    """Execute PydanticAI agent call.

    Note: timeout parameter is provided for information but timeout enforcement
    is handled by the framework wrapping this call in asyncio.wait_for().

    Args:
        prompt: The prompt to send to the LLM
        attempt: Which retry attempt this is (1, 2, 3, ...)
        timeout: Maximum time to wait for response (seconds)
        state: Optional retry state (v0.3.0, unused by this strategy)

    Returns:
        3-tuple ``(output, token_usage, metadata)``. PydanticAI's result
        object doesn't expose provider-side metadata uniformly, so
        ``metadata`` is currently always ``None`` here. (v0.10.0)
    """
    result = await self.agent.run(prompt)

    # Extract token usage FIRST (before accessing result.output which may fail validation).
    # pydantic-ai 1.x renamed request_tokens/response_tokens -> input_tokens/output_tokens
    # (the old names still exist but emit DeprecationWarning). Prefer the new
    # names, falling back to the legacy ones so both 0.x and 1.x work cleanly.
    usage = _run_result_usage(result)
    tokens: TokenUsage = {
        "input_tokens": _usage_field(usage, "input_tokens", "request_tokens"),
        "output_tokens": _usage_field(usage, "output_tokens", "response_tokens"),
        "total_tokens": _usage_field(usage, "total_tokens"),
    }

    # Access result.output (may raise validation errors)
    try:
        output = result.output
    except Exception as e:
        # Attach token usage to exception so framework can track it
        _attach_token_usage(e, tokens)

    return output, tokens, None

GeminiStrategy

async_batch_llm.GeminiStrategy

GeminiStrategy(model: LLMModel, response_parser: None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

GeminiStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput], *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

GeminiStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput] | None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

Bases: ModelStrategy[TOutput]

Strategy for calling a Gemini model and parsing the response.

Accepts an LLMModel (e.g., GeminiModel or GeminiCachedModel) and a response parser. The model handles the API call and token extraction; the strategy handles response parsing and lifecycle delegation.

For caching, use GeminiStrategy(model=GeminiCachedModel(...)).

v0.6.0: Accepts LLMModel instead of raw client + model string.

Example

model = GeminiModel("gemini-2.5-flash", client) strategy = GeminiStrategy(model, response_parser=lambda r: r.text)

With caching:

cached_model = GeminiCachedModel("gemini-2.5-flash", client, cached_content=[...]) strategy = GeminiStrategy(cached_model, response_parser=lambda r: r.text)

Source code in src/async_batch_llm/llm_strategies.py

def __init__(
    self,
    model: "LLMModel",
    response_parser: Callable[[LLMResponse], TOutput] | None = None,
    *,
    temperature: float | None = 0.0,
    generation_config: dict[str, Any] | None = None,
) -> None:
    """
    Initialize strategy.

    Args:
        model: An LLMModel instance (e.g., GeminiModel, OpenAIModel).
        response_parser: Function to parse LLMResponse into TOutput. Defaults to
            returning ``response.text`` — only valid when TOutput is ``str``.
            When TOutput is any other type, pass a response_parser (enforced
            by the @overload signatures).
        temperature: Default sampling temperature. Pass ``None`` to omit the
            parameter and use the provider default (e.g. for OpenAI
            reasoning models that reject an explicit temperature).
        generation_config: Provider-specific config forwarded to
            ``model.generate(config=...)`` on every call — e.g. Gemini's
            ``response_schema`` / ``response_mime_type`` / ``tools``, or an
            OpenAI-compatible ``response_format`` / ``max_tokens`` (merged
            into the model's ``extra_body``). Lets a built-in strategy carry
            native structured output or grounding without subclassing
            ``execute()``. A subclass that overrides ``execute()`` and needs a
            *per-attempt* config can read ``self.generation_config`` and merge.
    """
    self.model = model
    # The overloads restrict the None-parser path to TOutput=str, so the cast
    # below is sound at static-analysis time.
    self.response_parser = response_parser or (lambda response: cast(TOutput, response.text))
    self.temperature = temperature
    self.generation_config = generation_config

OpenAIStrategy

async_batch_llm.OpenAIStrategy

OpenAIStrategy(model: LLMModel, response_parser: None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

OpenAIStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput], *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

OpenAIStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput] | None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

Bases: ModelStrategy[TOutput]

Strategy for calling an OpenAI-compatible model and parsing the response.

Accepts an LLMModel (typically OpenAIModel) and an optional response parser. The model handles the API call and token extraction; the strategy handles response parsing and lifecycle delegation.

Added in v0.9.0.

Example

model = OpenAIModel.from_api_key("gpt-4o-mini", api_key="sk-...") strategy = OpenAIStrategy(model)

Structured output via response_parser:

strategy = OpenAIStrategy( ... model, ... response_parser=lambda r: MyModel.model_validate_json(r.text), ... )

Source code in src/async_batch_llm/llm_strategies.py

def __init__(
    self,
    model: "LLMModel",
    response_parser: Callable[[LLMResponse], TOutput] | None = None,
    *,
    temperature: float | None = 0.0,
    generation_config: dict[str, Any] | None = None,
) -> None:
    """
    Initialize strategy.

    Args:
        model: An LLMModel instance (e.g., GeminiModel, OpenAIModel).
        response_parser: Function to parse LLMResponse into TOutput. Defaults to
            returning ``response.text`` — only valid when TOutput is ``str``.
            When TOutput is any other type, pass a response_parser (enforced
            by the @overload signatures).
        temperature: Default sampling temperature. Pass ``None`` to omit the
            parameter and use the provider default (e.g. for OpenAI
            reasoning models that reject an explicit temperature).
        generation_config: Provider-specific config forwarded to
            ``model.generate(config=...)`` on every call — e.g. Gemini's
            ``response_schema`` / ``response_mime_type`` / ``tools``, or an
            OpenAI-compatible ``response_format`` / ``max_tokens`` (merged
            into the model's ``extra_body``). Lets a built-in strategy carry
            native structured output or grounding without subclassing
            ``execute()``. A subclass that overrides ``execute()`` and needs a
            *per-attempt* config can read ``self.generation_config`` and merge.
    """
    self.model = model
    # The overloads restrict the None-parser path to TOutput=str, so the cast
    # below is sound at static-analysis time.
    self.response_parser = response_parser or (lambda response: cast(TOutput, response.text))
    self.temperature = temperature
    self.generation_config = generation_config

OpenRouterStrategy

async_batch_llm.OpenRouterStrategy

OpenRouterStrategy(model: LLMModel, response_parser: None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

OpenRouterStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput], *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

OpenRouterStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput] | None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

Bases: ModelStrategy[TOutput]

Strategy for calling an OpenRouter-backed model and parsing the response.

Functionally identical to :class:OpenAIStrategy (both delegate to an LLMModel via :class:ModelStrategy); the separate class exists for provider-named symmetry so users can pick the strategy named after the provider they're using. For OpenRouter, LLMResponse.metadata typically includes provider (the upstream that served the request), model (the actually-routed model), and finish_reason.

Added in v0.9.0.

Example

model = OpenRouterModel.from_api_key( ... "anthropic/claude-haiku-4-5", api_key="sk-or-...", ... ) strategy = OpenRouterStrategy(model)

Source code in src/async_batch_llm/llm_strategies.py

def __init__(
    self,
    model: "LLMModel",
    response_parser: Callable[[LLMResponse], TOutput] | None = None,
    *,
    temperature: float | None = 0.0,
    generation_config: dict[str, Any] | None = None,
) -> None:
    """
    Initialize strategy.

    Args:
        model: An LLMModel instance (e.g., GeminiModel, OpenAIModel).
        response_parser: Function to parse LLMResponse into TOutput. Defaults to
            returning ``response.text`` — only valid when TOutput is ``str``.
            When TOutput is any other type, pass a response_parser (enforced
            by the @overload signatures).
        temperature: Default sampling temperature. Pass ``None`` to omit the
            parameter and use the provider default (e.g. for OpenAI
            reasoning models that reject an explicit temperature).
        generation_config: Provider-specific config forwarded to
            ``model.generate(config=...)`` on every call — e.g. Gemini's
            ``response_schema`` / ``response_mime_type`` / ``tools``, or an
            OpenAI-compatible ``response_format`` / ``max_tokens`` (merged
            into the model's ``extra_body``). Lets a built-in strategy carry
            native structured output or grounding without subclassing
            ``execute()``. A subclass that overrides ``execute()`` and needs a
            *per-attempt* config can read ``self.generation_config`` and merge.
    """
    self.model = model
    # The overloads restrict the None-parser path to TOutput=str, so the cast
    # below is sound at static-analysis time.
    self.response_parser = response_parser or (lambda response: cast(TOutput, response.text))
    self.temperature = temperature
    self.generation_config = generation_config

DeepSeekStrategy

async_batch_llm.DeepSeekStrategy

DeepSeekStrategy(model: LLMModel, response_parser: None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

DeepSeekStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput], *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

DeepSeekStrategy(model: LLMModel, response_parser: Callable[[LLMResponse], TOutput] | None = None, *, temperature: float | None = 0.0, generation_config: dict[str, Any] | None = None)

Bases: ModelStrategy[TOutput]

Strategy for calling a DeepSeek model and parsing the response.

Functionally identical to :class:OpenAIStrategy (both delegate to an LLMModel via :class:ModelStrategy); the separate class exists for provider-named symmetry. Pair it with :class:DeepSeekModel, which surfaces DeepSeek's native cache-hit token counts.

Added in v0.10.0.

Example

model = DeepSeekModel.from_api_key("deepseek-chat", api_key="sk-...") strategy = DeepSeekStrategy(model)

Source code in src/async_batch_llm/llm_strategies.py

def __init__(
    self,
    model: "LLMModel",
    response_parser: Callable[[LLMResponse], TOutput] | None = None,
    *,
    temperature: float | None = 0.0,
    generation_config: dict[str, Any] | None = None,
) -> None:
    """
    Initialize strategy.

    Args:
        model: An LLMModel instance (e.g., GeminiModel, OpenAIModel).
        response_parser: Function to parse LLMResponse into TOutput. Defaults to
            returning ``response.text`` — only valid when TOutput is ``str``.
            When TOutput is any other type, pass a response_parser (enforced
            by the @overload signatures).
        temperature: Default sampling temperature. Pass ``None`` to omit the
            parameter and use the provider default (e.g. for OpenAI
            reasoning models that reject an explicit temperature).
        generation_config: Provider-specific config forwarded to
            ``model.generate(config=...)`` on every call — e.g. Gemini's
            ``response_schema`` / ``response_mime_type`` / ``tools``, or an
            OpenAI-compatible ``response_format`` / ``max_tokens`` (merged
            into the model's ``extra_body``). Lets a built-in strategy carry
            native structured output or grounding without subclassing
            ``execute()``. A subclass that overrides ``execute()`` and needs a
            *per-attempt* config can read ``self.generation_config`` and merge.
    """
    self.model = model
    # The overloads restrict the None-parser path to TOutput=str, so the cast
    # below is sound at static-analysis time.
    self.response_parser = response_parser or (lambda response: cast(TOutput, response.text))
    self.temperature = temperature
    self.generation_config = generation_config

Models

GeminiModel

async_batch_llm.GeminiModel

GeminiModel(model: str, client: Client, *, safety_settings: list[dict[str, Any]] | None = None, system_instruction: str | None = None, metadata_extractors: list[MetadataExtractor] | None = None)

LLM model backed by the Google Gemini API.

Wraps a genai.Client and model name, handling API calls, token extraction, and response normalization. Implements the LLMModel protocol.

Example

client = genai.Client(api_key="...") model = GeminiModel("gemini-2.5-flash", client) response = await model.generate("Hello!") print(response.text, response.input_tokens)

Added in v0.6.0.

Parameters:

Name	Type	Description	Default
`model`	`str`	Model name (e.g., "gemini-3.1-flash-lite-preview").	required
`client`	`Client`	Initialized genai.Client.	required
`safety_settings`	`list[dict[str, Any]] \| None`	Default safety settings for all calls.	`None`
`system_instruction`	`str \| None`	Default system instruction (overridable per-call).	`None`
`metadata_extractors`	`list[MetadataExtractor] \| None`	Optional hooks that contribute extra keys to `LLMResponse.metadata` (e.g. `grounding_metadata_extractor`). Merged on top of the built-in `safety_ratings`/`finish_reason`; user keys win. Added in v0.15.0.	`None`

Source code in src/async_batch_llm/models.py

def __init__(
    self,
    model: str,
    client: "genai.Client",
    *,
    safety_settings: list[dict[str, Any]] | None = None,
    system_instruction: str | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
):
    """
    Args:
        model: Model name (e.g., "gemini-3.1-flash-lite-preview").
        client: Initialized genai.Client.
        safety_settings: Default safety settings for all calls.
        system_instruction: Default system instruction (overridable per-call).
        metadata_extractors: Optional hooks that contribute extra keys to
            ``LLMResponse.metadata`` (e.g. ``grounding_metadata_extractor``).
            Merged on top of the built-in ``safety_ratings``/``finish_reason``;
            user keys win. Added in v0.15.0.
    """
    if genai is None:
        raise ImportError(
            "google-genai is required for GeminiModel. "
            "Install with: pip install 'async-batch-llm[gemini]'"
        )

    self._model = model
    self._client = client
    self._safety_settings = safety_settings
    self._default_system_instruction = system_instruction
    self._metadata_extractors = metadata_extractors

generate `async`

generate(prompt: str | list[Any], *, temperature: float | None = 0.0, system_instruction: str | None = None, config: dict[str, Any] | None = None) -> LLMResponse

Generate a response from Gemini.

Parameters:

Name	Type	Description	Default
`prompt`	`str \| list[Any]`	Text prompt or list of content parts (multimodal).	required
`temperature`	`float \| None`	Sampling temperature. Pass `None` to omit it and use the model default.	`0.0`
`system_instruction`	`str \| None`	Override default system instruction.	`None`
`config`	`dict[str, Any] \| None`	Additional provider-specific config entries.	`None`

Returns:

Type	Description
`LLMResponse`	Normalized LLMResponse.

Source code in src/async_batch_llm/models.py

async def generate(
    self,
    prompt: str | list[Any],
    *,
    temperature: float | None = 0.0,
    system_instruction: str | None = None,
    config: dict[str, Any] | None = None,
) -> LLMResponse:
    """Generate a response from Gemini.

    Args:
        prompt: Text prompt or list of content parts (multimodal).
        temperature: Sampling temperature. Pass ``None`` to omit it and
            use the model default.
        system_instruction: Override default system instruction.
        config: Additional provider-specific config entries.

    Returns:
        Normalized LLMResponse.
    """
    # Build config dict
    call_config: dict[str, Any] = {}
    if temperature is not None:
        call_config["temperature"] = temperature

    si = system_instruction or self._default_system_instruction
    if si is not None:
        call_config["system_instruction"] = si

    if self._safety_settings:
        call_config["safety_settings"] = self._safety_settings

    if config:
        call_config.update(config)

    # Make the API call (config is built as a dict; the SDK accepts this at runtime
    # even though the type stubs say GenerateContentConfig)
    response = await self._client.aio.models.generate_content(
        model=self._model,
        contents=prompt,
        config=call_config,  # type: ignore[arg-type]  # ty:ignore[invalid-argument-type]
    )

    # Extract tokens
    tokens = _extract_tokens(response)
    input_tokens, output_tokens, total_tokens, cached_tokens = tokens

    # Extract text (may be None if safety-blocked)
    text = response.text
    if text is None:
        metadata = _extract_metadata(response)
        safety_info = ""
        if metadata and "safety_ratings" in metadata:
            safety_info = f" Safety ratings: {metadata['safety_ratings']}"
        _raise_empty_response(
            f"Empty response from model (likely blocked by safety filter).{safety_info}",
            tokens,
        )

    return LLMResponse(
        text=text,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        total_tokens=total_tokens,
        cached_input_tokens=cached_tokens,
        metadata=self._build_metadata(response),
        raw=response,
    )

GeminiCachedModel

async_batch_llm.GeminiCachedModel

GeminiCachedModel(model: str, client: Client, cached_content: list[Content], *, cache_ttl_seconds: int = 3600, cache_renewal_buffer_seconds: int = 300, auto_renew: bool = True, cache_tags: dict[str, str] | None = None, safety_settings: list[dict[str, Any]] | None = None, metadata_extractors: list[MetadataExtractor] | None = None)

LLM model backed by Google Gemini with context caching.

Wraps a genai.Client with cache lifecycle management. Implements the ManagedLLMModel protocol: call prepare() before first use, cleanup() when done.

IMPORTANT — share one instance across work items. Create ONE GeminiCachedModel and reuse it across every LLMWorkItem that should share the cached context. Constructing a new instance per item defeats caching entirely and can cost 10× more. The framework calls prepare() exactly once per unique instance, so sharing is the intended lifecycle. See examples/example_llm_strategies.py for the pattern.

This provides 70-90% cost savings when shared correctly.

Example

model = GeminiCachedModel( ... "gemini-2.5-flash", client, ... cached_content=[system_instruction, context_docs], ... ) await model.prepare() # finds or creates cache response = await model.generate("Process this") await model.cleanup() # preserves cache for reuse

Added in v0.6.0.

Parameters:

Name	Type	Description	Default
`model`	`str`	Model name (e.g., "gemini-2.5-flash").	required
`client`	`Client`	Initialized genai.Client.	required
`cached_content`	`list[Content]`	Content to cache (system instructions, documents).	required
`cache_ttl_seconds`	`int`	Cache TTL in seconds (default: 3600 = 1 hour).	`3600`
`cache_renewal_buffer_seconds`	`int`	Renew this many seconds before expiry (default: 300 = 5 minutes).	`300`
`auto_renew`	`bool`	Auto-renew expired caches in generate() (default: True).	`True`
`cache_tags`	`dict[str, str] \| None`	Tags for precise cache matching. Encoded into the cache's `display_name` at creation (google-genai `CreateCachedContentConfig` has no `metadata` field) and decoded on lookup. Keep tag values short — Gemini's `display_name` has a 128-character limit.	`None`
`safety_settings`	`list[dict[str, Any]] \| None`	Default safety settings for all calls.	`None`
`metadata_extractors`	`list[MetadataExtractor] \| None`	Optional hooks that contribute extra keys to `LLMResponse.metadata` (e.g. `grounding_metadata_extractor`). Merged on top of the built-in metadata; user keys win. Added in v0.15.0.	`None`

Source code in src/async_batch_llm/models.py

def __init__(
    self,
    model: str,
    client: "genai.Client",
    cached_content: list["Content"],
    *,
    cache_ttl_seconds: int = 3600,
    cache_renewal_buffer_seconds: int = 300,
    auto_renew: bool = True,
    cache_tags: dict[str, str] | None = None,
    safety_settings: list[dict[str, Any]] | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
):
    """
    Args:
        model: Model name (e.g., "gemini-2.5-flash").
        client: Initialized genai.Client.
        cached_content: Content to cache (system instructions, documents).
        cache_ttl_seconds: Cache TTL in seconds (default: 3600 = 1 hour).
        cache_renewal_buffer_seconds: Renew this many seconds before expiry
            (default: 300 = 5 minutes).
        auto_renew: Auto-renew expired caches in generate() (default: True).
        cache_tags: Tags for precise cache matching. Encoded into the cache's
            ``display_name`` at creation (google-genai ``CreateCachedContentConfig``
            has no ``metadata`` field) and decoded on lookup. Keep tag values
            short — Gemini's ``display_name`` has a 128-character limit.
        safety_settings: Default safety settings for all calls.
        metadata_extractors: Optional hooks that contribute extra keys to
            ``LLMResponse.metadata`` (e.g. ``grounding_metadata_extractor``).
            Merged on top of the built-in metadata; user keys win.
            Added in v0.15.0.
    """
    if genai is None:
        raise ImportError(
            "google-genai is required for GeminiCachedModel. "
            "Install with: pip install 'async-batch-llm[gemini]'"
        )

    if cache_renewal_buffer_seconds >= cache_ttl_seconds:
        raise ValueError(
            f"cache_renewal_buffer_seconds ({cache_renewal_buffer_seconds}) "
            f"must be less than cache_ttl_seconds ({cache_ttl_seconds})."
        )

    if 10 <= cache_ttl_seconds < 60:
        import warnings

        warnings.warn(
            f"cache_ttl_seconds ({cache_ttl_seconds}) is less than 60 seconds. "
            f"Very short TTLs defeat the purpose of caching. "
            f"Recommended minimum: 300 seconds (5 minutes).",
            UserWarning,
            stacklevel=2,
        )

    if cache_renewal_buffer_seconds < 60:
        import warnings

        warnings.warn(
            f"cache_renewal_buffer_seconds ({cache_renewal_buffer_seconds}) is less than "
            f"60 seconds. Small buffers risk renewing on every call if generation takes "
            f"longer than the buffer. Recommended minimum: 60 seconds.",
            UserWarning,
            stacklevel=2,
        )

    self._model = model
    self._client = client
    self._cached_content = cached_content
    self._cache_ttl_seconds = cache_ttl_seconds
    self._cache_renewal_buffer_seconds = cache_renewal_buffer_seconds
    self._auto_renew = auto_renew
    self._cache_tags = cache_tags or {}
    self._safety_settings = safety_settings
    self._metadata_extractors = metadata_extractors

    self._cache: Any = None
    self._cache_created_at: float | None = None
    self._cache_lock: Any = None
    self._prepared = False

cache_name `property`

cache_name: str | None

The name of the active cache, or None.

cleanup `async`

cleanup() -> None

Preserve cache for reuse (does not delete). Idempotent.

Source code in src/async_batch_llm/models.py

async def cleanup(self) -> None:
    """Preserve cache for reuse (does not delete). Idempotent."""
    if self._cache:
        logger.info(
            f"Leaving cache active for reuse: {self._cache.name} "
            f"(TTL: {self._cache_ttl_seconds}s, will expire naturally)"
        )

delete_cache `async`

delete_cache() -> None

Explicitly delete the cache.

Safe to call concurrently: the cache lock serializes delete attempts so the provider API fires at most once, and late callers that arrive after the cache is cleared return silently.

Source code in src/async_batch_llm/models.py

async def delete_cache(self) -> None:
    """Explicitly delete the cache.

    Safe to call concurrently: the cache lock serializes delete attempts
    so the provider API fires at most once, and late callers that arrive
    after the cache is cleared return silently.
    """
    if self._cache is None:
        return

    import asyncio as _asyncio

    if getattr(self, "_cache_lock", None) is None:
        self._cache_lock = _asyncio.Lock()

    async with self._cache_lock:
        cache = self._cache
        if cache is None:
            # A concurrent caller already finished the delete.
            return

        # Capture the name up front so log messages don't depend on
        # self._cache still existing after concurrent callers clear it.
        cache_name = cache.name
        # Clear state BEFORE the API call so concurrent tasks that
        # re-enter see an empty cache and no-op.
        self._cache = None
        self._cache_created_at = None
        self._prepared = False

        try:
            await self._client.aio.caches.delete(name=cache_name)
            logger.info(f"Deleted Gemini cache: {cache_name}")
        except Exception as e:
            # Keep Exception (not BaseException) so KeyboardInterrupt still propagates;
            # cache-delete failures are best-effort — caches expire on their own.
            logger.warning(
                f"Failed to delete Gemini cache '{cache_name}': {e}. "
                "Cache may have already expired or been deleted.",
                exc_info=True,
            )

generate `async`

generate(prompt: str | list[Any], *, temperature: float | None = 0.0, system_instruction: str | None = None, config: dict[str, Any] | None = None) -> LLMResponse

Generate a response using the cached context.

Parameters:

Name	Type	Description	Default
`prompt`	`str \| list[Any]`	Text prompt or multimodal content parts.	required
`temperature`	`float \| None`	Sampling temperature. Pass `None` to omit it and use the model default.	`0.0`
`system_instruction`	`str \| None`	Not supported with caching — raises ValueError.	`None`
`config`	`dict[str, Any] \| None`	Additional provider-specific config entries.	`None`

Returns:

Type	Description
`LLMResponse`	Normalized LLMResponse.

Source code in src/async_batch_llm/models.py

async def generate(
    self,
    prompt: str | list[Any],
    *,
    temperature: float | None = 0.0,
    system_instruction: str | None = None,
    config: dict[str, Any] | None = None,
) -> LLMResponse:
    """Generate a response using the cached context.

    Args:
        prompt: Text prompt or multimodal content parts.
        temperature: Sampling temperature. Pass ``None`` to omit it and
            use the model default.
        system_instruction: Not supported with caching — raises ValueError.
        config: Additional provider-specific config entries.

    Returns:
        Normalized LLMResponse.
    """
    if system_instruction is not None:
        raise ValueError(
            "system_instruction cannot be overridden per-call with cached models. "
            "The system instruction is baked into the cache at creation time."
        )

    # Auto-renew if expired
    if self._auto_renew and self._is_cache_expired():
        import asyncio

        if self._cache_lock is None:
            self._cache_lock = asyncio.Lock()

        async with self._cache_lock:
            if self._is_cache_expired():
                age_str = (
                    f"{time.time() - self._cache_created_at:.0f}s"
                    if self._cache_created_at is not None
                    else "unknown (cache not yet initialized)"
                )
                logger.info(
                    "Cache expired or about to expire, renewing before API call "
                    f"(age: {age_str}, "
                    f"renewal buffer: {self._cache_renewal_buffer_seconds}s)"
                )
                await self._renew_cache()

    if self._cache is None:
        raise RuntimeError("Cache not initialized — call prepare() first")

    # Build config with cache reference
    call_config: dict[str, Any] = {
        "cached_content": self._cache.name,
    }
    if temperature is not None:
        call_config["temperature"] = temperature

    if self._safety_settings:
        call_config["safety_settings"] = self._safety_settings

    if config:
        call_config.update(config)

    response = await self._client.aio.models.generate_content(
        model=self._model,
        contents=prompt,
        config=call_config,  # type: ignore[arg-type]  # ty:ignore[invalid-argument-type]
    )

    tokens = _extract_tokens(response)
    input_tokens, output_tokens, total_tokens, cached_tokens = tokens

    text = response.text
    if text is None:
        metadata = _extract_metadata(response)
        safety_info = ""
        if metadata and "safety_ratings" in metadata:
            safety_info = f" Safety ratings: {metadata['safety_ratings']}"
        _raise_empty_response(
            f"Empty response from model (likely blocked by safety filter).{safety_info}",
            tokens,
        )

    return LLMResponse(
        text=text,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        total_tokens=total_tokens,
        cached_input_tokens=cached_tokens,
        metadata=self._build_metadata(response),
        raw=response,
    )

prepare `async`

prepare() -> None

Find or create the Gemini cache. Idempotent.

Source code in src/async_batch_llm/models.py

async def prepare(self) -> None:
    """Find or create the Gemini cache. Idempotent."""
    if self._prepared:
        return

    import asyncio

    if self._cache_lock is None:
        self._cache_lock = asyncio.Lock()

    async with self._cache_lock:
        if self._prepared:
            return
        await self._find_or_create_cache()
        self._prepared = True

OpenAICompatibleModel

async_batch_llm.OpenAICompatibleModel

OpenAICompatibleModel(model: str, client: AsyncOpenAI, *, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, metadata_extractors: list[MetadataExtractor] | None = None)

Base class for OpenAI chat-completions-compatible providers.

Wraps an AsyncOpenAI client pointed at any chat-completions endpoint (OpenAI itself, OpenRouter, DeepSeek, HuggingFace Inference Providers, Together, Fireworks, local vLLM, etc.). Subclasses customize the default base URL, the install-extras hint, the env var read by :meth:from_api_key, and optionally the token/metadata extractors.

Implements the ManagedLLMModel protocol — :meth:cleanup closes the underlying AsyncOpenAI client when this model owns it (i.e. it was constructed via :meth:from_api_key). User-provided clients are left alone.

Added in v0.9.0.

Parameters:

Name	Type	Description	Default
`model`	`str`	Provider model id (e.g. "gpt-4o-mini" or "anthropic/claude-haiku-4-5").	required
`client`	`AsyncOpenAI`	Initialized AsyncOpenAI (point `base_url` at the desired endpoint). The model does NOT take ownership of the client — use :meth:`from_api_key` if you want the model to manage the client's lifecycle.	required
`system_instruction`	`str \| None`	Default system message prepended to each call. Per-call `system_instruction` argument takes precedence.	`None`
`extra_headers`	`dict[str, str] \| None`	Default headers forwarded on every call (e.g. OpenRouter's `HTTP-Referer`/`X-Title`).	`None`
`extra_body`	`dict[str, Any] \| None`	Default extra body fields forwarded on every call (e.g. OpenRouter `provider` routing config).	`None`
`metadata_extractors`	`list[MetadataExtractor] \| None`	Optional hooks that contribute extra keys to `LLMResponse.metadata` (e.g. `reasoning_content` or `logprobs`). Merged on top of the built-in `finish_reason`/`model` metadata; user keys win. Added in v0.15.0.	`None`

Source code in src/async_batch_llm/models.py

def __init__(
    self,
    model: str,
    client: "AsyncOpenAI",
    *,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
):
    """
    Args:
        model: Provider model id (e.g. "gpt-4o-mini" or
            "anthropic/claude-haiku-4-5").
        client: Initialized AsyncOpenAI (point ``base_url`` at the desired
            endpoint). The model does NOT take ownership of the client —
            use :meth:`from_api_key` if you want the model to manage the
            client's lifecycle.
        system_instruction: Default system message prepended to each call.
            Per-call ``system_instruction`` argument takes precedence.
        extra_headers: Default headers forwarded on every call (e.g.
            OpenRouter's ``HTTP-Referer``/``X-Title``).
        extra_body: Default extra body fields forwarded on every call
            (e.g. OpenRouter ``provider`` routing config).
        metadata_extractors: Optional hooks that contribute extra keys to
            ``LLMResponse.metadata`` (e.g. ``reasoning_content`` or
            ``logprobs``). Merged on top of the built-in
            ``finish_reason``/``model`` metadata; user keys win.
            Added in v0.15.0.
    """
    if AsyncOpenAI is None:
        raise ImportError(
            f"openai is required for {type(self).__name__}. "
            f"Install with: pip install 'async-batch-llm[{self._install_extras}]'"
        )

    self._model = model
    self._client = client
    self._default_system_instruction = system_instruction
    self._default_extra_headers = extra_headers
    self._default_extra_body = extra_body
    self._metadata_extractors = metadata_extractors
    # Set to True only by from_api_key(); cleanup() uses this to decide
    # whether to close the underlying httpx connections.
    self._owns_client: bool = False

cleanup `async`

cleanup() -> None

Close the underlying AsyncOpenAI client if this model owns it.

Models constructed directly with OpenAIModel(model, client=...) do NOT own the client — the caller is expected to close it. Models constructed via :meth:from_api_key do own the client and close it here so repeated processor runs don't leak httpx connections.

Source code in src/async_batch_llm/models.py

async def cleanup(self) -> None:
    """Close the underlying AsyncOpenAI client if this model owns it.

    Models constructed directly with ``OpenAIModel(model, client=...)``
    do NOT own the client — the caller is expected to close it. Models
    constructed via :meth:`from_api_key` do own the client and close
    it here so repeated processor runs don't leak httpx connections.
    """
    if not self._owns_client or self._client is None:
        return
    close = getattr(self._client, "close", None)
    if close is None:
        return
    try:
        result = close()
        if hasattr(result, "__await__"):
            await result
    except Exception as e:
        logger.warning(
            f"Failed to close {type(self).__name__} client: {e}",
            exc_info=True,
        )

from_api_key `classmethod`

from_api_key(model: str, api_key: str | None = None, *, base_url: str | None = None, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, json_mode: bool = False, max_connections: int | None = None, metadata_extractors: list[MetadataExtractor] | None = None, **client_kwargs: Any) -> TM

Build the model with a freshly-constructed AsyncOpenAI client.

The returned model owns the client — its connections are released when the framework calls :meth:cleanup (typically when the ParallelBatchProcessor exits).

Uses base_url (if provided) or the class's _default_base_url. Pass client_kwargs to forward additional kwargs (timeout, max_retries, http_client, etc.) to the SDK constructor.

Parameters:

Name	Type	Description	Default
`model`	`str`	Provider model id.	required
`api_key`	`str \| None`	API key. If `None`: For `OpenAIModel` the OpenAI SDK auto-reads `OPENAI_API_KEY`. For `OpenRouterModel` (and other subclasses with `_api_key_env_var` set) we read the env var ourselves and forward it to the SDK explicitly. If neither path resolves, raises `ValueError`.	`None`
`json_mode`	`bool`	When `True`, request JSON output by adding `response_format={"type": "json_object"}` to `extra_body` (forwarded on every call). A convenience over hand-passing it yourself; an explicit `response_format` in `extra_body` takes precedence. Most providers still require the word "JSON" somewhere in your prompt/system instruction for this to take effect. Pair with :func:`async_batch_llm.pydantic_json_parser` on the strategy, since some providers (DeepSeek) still wrap the JSON in markdown fences even in JSON mode (issue #26).	`False`
`max_connections`	`int \| None`	Size of the underlying httpx connection pool (both `max_connections` and `max_keepalive_connections`). Set this to at least `ProcessorConfig.max_workers` — the openai SDK otherwise uses httpx's default pool (~100), so raising `max_workers` above that gives no extra throughput; the excess workers just block waiting for a connection (see issue #25). High-concurrency providers like DeepSeek (which allow thousands of concurrent connections) hit this ceiling first. Mutually exclusive with passing your own `http_client`; raises `ValueError` if you pass both.	`None`

Source code in src/async_batch_llm/models.py

@classmethod
def from_api_key(
    cls: type[TM],
    model: str,
    api_key: str | None = None,
    *,
    base_url: str | None = None,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    json_mode: bool = False,
    max_connections: int | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
    **client_kwargs: Any,
) -> TM:
    """Build the model with a freshly-constructed AsyncOpenAI client.

    The returned model owns the client — its connections are released
    when the framework calls :meth:`cleanup` (typically when the
    ``ParallelBatchProcessor`` exits).

    Uses ``base_url`` (if provided) or the class's ``_default_base_url``.
    Pass ``client_kwargs`` to forward additional kwargs (timeout,
    max_retries, http_client, etc.) to the SDK constructor.

    Args:
        model: Provider model id.
        api_key: API key. If ``None``:

            - For ``OpenAIModel`` the OpenAI SDK auto-reads
              ``OPENAI_API_KEY``.
            - For ``OpenRouterModel`` (and other subclasses with
              ``_api_key_env_var`` set) we read the env var ourselves
              and forward it to the SDK explicitly.
            - If neither path resolves, raises ``ValueError``.
        json_mode: When ``True``, request JSON output by adding
            ``response_format={"type": "json_object"}`` to ``extra_body``
            (forwarded on every call). A convenience over hand-passing it
            yourself; an explicit ``response_format`` in ``extra_body``
            takes precedence. Most providers still require the word "JSON"
            somewhere in your prompt/system instruction for this to take
            effect. Pair with :func:`async_batch_llm.pydantic_json_parser`
            on the strategy, since some providers (DeepSeek) still wrap the
            JSON in markdown fences even in JSON mode (issue #26).
        max_connections: Size of the underlying httpx connection pool
            (both ``max_connections`` and ``max_keepalive_connections``).
            **Set this to at least ``ProcessorConfig.max_workers``** — the
            openai SDK otherwise uses httpx's default pool (~100), so
            raising ``max_workers`` above that gives no extra throughput;
            the excess workers just block waiting for a connection (see
            issue #25). High-concurrency providers like DeepSeek (which
            allow thousands of concurrent connections) hit this ceiling
            first. Mutually exclusive with passing your own
            ``http_client``; raises ``ValueError`` if you pass both.
    """
    if AsyncOpenAI is None:
        raise ImportError(
            f"openai is required for {cls.__name__}. "
            f"Install with: pip install 'async-batch-llm[{cls._install_extras}]'"
        )
    if max_connections is not None:
        if "http_client" in client_kwargs:
            raise ValueError(
                "Pass either max_connections or http_client, not both. "
                "max_connections is a convenience for sizing the default "
                "httpx pool; if you build your own http_client, set its "
                "limits there instead."
            )
        if max_connections < 1:
            raise ValueError(f"max_connections must be >= 1; got {max_connections}.")
        import httpx

        client_kwargs["http_client"] = httpx.AsyncClient(
            limits=httpx.Limits(
                max_connections=max_connections,
                max_keepalive_connections=max_connections,
            )
        )
    effective_base_url = base_url or cls._default_base_url
    if effective_base_url is not None:
        client_kwargs.setdefault("base_url", effective_base_url)

    resolved_key = api_key
    if resolved_key is None and cls._api_key_env_var is not None:
        import os as _os

        resolved_key = _os.environ.get(cls._api_key_env_var)
        if not resolved_key:
            raise ValueError(
                f"No API key for {cls.__name__}: pass api_key= or set the "
                f"{cls._api_key_env_var} environment variable."
            )

    # When resolved_key is None and the SDK can self-resolve (OpenAIModel),
    # don't pass api_key at all — letting the SDK raise its own clear
    # error if neither path produces one.
    if resolved_key is not None:
        client = AsyncOpenAI(api_key=resolved_key, **client_kwargs)
    else:
        client = AsyncOpenAI(**client_kwargs)

    if json_mode:
        # Inject a JSON response_format, letting any explicit caller-supplied
        # response_format in extra_body win.
        effective_extra_body: dict[str, Any] = {"response_format": {"type": "json_object"}}
        if extra_body:
            effective_extra_body.update(extra_body)
        extra_body = effective_extra_body

    instance = cls(
        model,
        client,
        system_instruction=system_instruction,
        extra_headers=extra_headers,
        extra_body=extra_body,
        metadata_extractors=metadata_extractors,
    )
    instance._owns_client = True
    return instance

generate `async`

generate(prompt: str | list[Any], *, temperature: float | None = 0.0, system_instruction: str | None = None, config: dict[str, Any] | None = None) -> LLMResponse

Call client.chat.completions.create and normalize the response.

Parameters:

Name	Type	Description	Default
`prompt`	`str \| list[Any]`	A string (becomes a single user message) or a list of OpenAI-shaped message dicts (passed through unchanged — used for multimodal content and Anthropic-via-OpenRouter `cache_control` markers).	required
`temperature`	`float \| None`	Sampling temperature. Pass `None` to omit the parameter so the provider uses its own default — required for OpenAI reasoning models (o1/o3/etc.) that reject an explicit `temperature`.	`0.0`
`system_instruction`	`str \| None`	Per-call override for the system message.	`None`
`config`	`dict[str, Any] \| None`	Per-call extra kwargs forwarded to the SDK call (merged over the instance's `extra_body`). Use this to pass `max_tokens`, `response_format`, etc.	`None`

Returns:

Type	Description
`LLMResponse`	Normalized LLMResponse.

Source code in src/async_batch_llm/models.py

async def generate(
    self,
    prompt: str | list[Any],
    *,
    temperature: float | None = 0.0,
    system_instruction: str | None = None,
    config: dict[str, Any] | None = None,
) -> LLMResponse:
    """Call ``client.chat.completions.create`` and normalize the response.

    Args:
        prompt: A string (becomes a single user message) or a list of
            OpenAI-shaped message dicts (passed through unchanged — used
            for multimodal content and Anthropic-via-OpenRouter
            ``cache_control`` markers).
        temperature: Sampling temperature. Pass ``None`` to omit the
            parameter so the provider uses its own default — required for
            OpenAI reasoning models (o1/o3/etc.) that reject an explicit
            ``temperature``.
        system_instruction: Per-call override for the system message.
        config: Per-call extra kwargs forwarded to the SDK call (merged
            over the instance's ``extra_body``). Use this to pass
            ``max_tokens``, ``response_format``, etc.

    Returns:
        Normalized LLMResponse.
    """
    messages = _coerce_to_messages(prompt)
    si = system_instruction or self._default_system_instruction
    if si is not None and not _has_system_message(messages):
        messages = [{"role": "system", "content": si}, *messages]

    # Merge extra_body: instance defaults + per-call config overrides.
    extra_body: dict[str, Any] = {}
    if self._default_extra_body:
        extra_body.update(self._default_extra_body)
    if config:
        extra_body.update(config)

    call_kwargs: dict[str, Any] = {
        "model": self._model,
        "messages": messages,
    }
    # Omit temperature entirely when None so providers that reject an
    # explicit value (OpenAI reasoning models) use their own default.
    if temperature is not None:
        call_kwargs["temperature"] = temperature
    if extra_body:
        call_kwargs["extra_body"] = extra_body
    if self._default_extra_headers:
        call_kwargs["extra_headers"] = self._default_extra_headers

    response = await self._client.chat.completions.create(**call_kwargs)

    # Some gateways report upstream failures inside an HTTP-200 body;
    # surface those before interpreting the (empty) choices list.
    self._raise_on_response_error(response)

    # Extract tokens up front: even a no-content response was billed for
    # the prompt (and, for finish_reason="length", the full output).
    tokens = self._extract_tokens(response)
    input_tokens, output_tokens, total_tokens, cached_tokens = tokens

    # Validate content present (None typically means a tool call or a
    # finish-reason like "length"/"content_filter").
    if not response.choices:
        _raise_empty_response(f"No choices returned from {type(self).__name__}.", tokens)
    message = response.choices[0].message
    text = getattr(message, "content", None)
    if text is None:
        finish_reason = getattr(response.choices[0], "finish_reason", "unknown")
        _raise_empty_response(
            f"Empty response content from model "
            f"(finish_reason={finish_reason!r}). "
            "This typically indicates a tool call, content filter, "
            "or token limit was reached.",
            tokens,
        )

    return LLMResponse(
        text=text,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        total_tokens=total_tokens,
        cached_input_tokens=cached_tokens,
        metadata=self._build_metadata(response),
        raw=response,
    )

prepare `async`

prepare() -> None

No-op; OpenAI-compatible models have nothing to initialize.

Source code in src/async_batch_llm/models.py

async def prepare(self) -> None:
    """No-op; OpenAI-compatible models have nothing to initialize."""
    return

OpenAIModel

async_batch_llm.OpenAIModel

OpenAIModel(model: str, client: AsyncOpenAI, *, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, metadata_extractors: list[MetadataExtractor] | None = None)

Bases: OpenAICompatibleModel

LLM model backed by OpenAI's chat completions API.

Uses the OpenAI SDK's default base URL (https://api.openai.com/v1). OpenAI's automatic prompt cache surfaces in cached_input_tokens for prompts longer than ~1024 tokens.

Example

model = OpenAIModel.from_api_key("gpt-4o-mini", api_key="sk-...") response = await model.generate("Hello!") print(response.text, response.cached_input_tokens)

Added in v0.9.0.

Source code in src/async_batch_llm/models.py

def __init__(
    self,
    model: str,
    client: "AsyncOpenAI",
    *,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
):
    """
    Args:
        model: Provider model id (e.g. "gpt-4o-mini" or
            "anthropic/claude-haiku-4-5").
        client: Initialized AsyncOpenAI (point ``base_url`` at the desired
            endpoint). The model does NOT take ownership of the client —
            use :meth:`from_api_key` if you want the model to manage the
            client's lifecycle.
        system_instruction: Default system message prepended to each call.
            Per-call ``system_instruction`` argument takes precedence.
        extra_headers: Default headers forwarded on every call (e.g.
            OpenRouter's ``HTTP-Referer``/``X-Title``).
        extra_body: Default extra body fields forwarded on every call
            (e.g. OpenRouter ``provider`` routing config).
        metadata_extractors: Optional hooks that contribute extra keys to
            ``LLMResponse.metadata`` (e.g. ``reasoning_content`` or
            ``logprobs``). Merged on top of the built-in
            ``finish_reason``/``model`` metadata; user keys win.
            Added in v0.15.0.
    """
    if AsyncOpenAI is None:
        raise ImportError(
            f"openai is required for {type(self).__name__}. "
            f"Install with: pip install 'async-batch-llm[{self._install_extras}]'"
        )

    self._model = model
    self._client = client
    self._default_system_instruction = system_instruction
    self._default_extra_headers = extra_headers
    self._default_extra_body = extra_body
    self._metadata_extractors = metadata_extractors
    # Set to True only by from_api_key(); cleanup() uses this to decide
    # whether to close the underlying httpx connections.
    self._owns_client: bool = False

OpenRouterModel

async_batch_llm.OpenRouterModel

OpenRouterModel(model: str, client: AsyncOpenAI, *, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, metadata_extractors: list[MetadataExtractor] | None = None)

Bases: OpenAICompatibleModel

LLM model backed by OpenRouter (https://openrouter.ai).

OpenRouter exposes a unified OpenAI-compatible API for many upstream providers (Anthropic, OpenAI, Google, Mistral, DeepSeek, etc.). Model ids are prefixed with the provider, e.g. "anthropic/claude-haiku-4-5".

Caching is provider-dependent:

OpenAI / Gemini (implicit) / DeepSeek — automatic; cached_input_tokens is populated when the upstream cache hits.
Anthropic — opt-in. Pass prompt as a list of message dicts with cache_control: {"type": "ephemeral"} markers on the blocks you want cached.

Example

model = OpenRouterModel.from_api_key( ... "anthropic/claude-haiku-4-5", ... api_key="sk-or-...", ... referer="https://my-app.example.com", ... title="My App", ... ) response = await model.generate("Hello!")

Added in v0.9.0.

Source code in src/async_batch_llm/models.py

def __init__(
    self,
    model: str,
    client: "AsyncOpenAI",
    *,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
):
    """
    Args:
        model: Provider model id (e.g. "gpt-4o-mini" or
            "anthropic/claude-haiku-4-5").
        client: Initialized AsyncOpenAI (point ``base_url`` at the desired
            endpoint). The model does NOT take ownership of the client —
            use :meth:`from_api_key` if you want the model to manage the
            client's lifecycle.
        system_instruction: Default system message prepended to each call.
            Per-call ``system_instruction`` argument takes precedence.
        extra_headers: Default headers forwarded on every call (e.g.
            OpenRouter's ``HTTP-Referer``/``X-Title``).
        extra_body: Default extra body fields forwarded on every call
            (e.g. OpenRouter ``provider`` routing config).
        metadata_extractors: Optional hooks that contribute extra keys to
            ``LLMResponse.metadata`` (e.g. ``reasoning_content`` or
            ``logprobs``). Merged on top of the built-in
            ``finish_reason``/``model`` metadata; user keys win.
            Added in v0.15.0.
    """
    if AsyncOpenAI is None:
        raise ImportError(
            f"openai is required for {type(self).__name__}. "
            f"Install with: pip install 'async-batch-llm[{self._install_extras}]'"
        )

    self._model = model
    self._client = client
    self._default_system_instruction = system_instruction
    self._default_extra_headers = extra_headers
    self._default_extra_body = extra_body
    self._metadata_extractors = metadata_extractors
    # Set to True only by from_api_key(); cleanup() uses this to decide
    # whether to close the underlying httpx connections.
    self._owns_client: bool = False

from_api_key `classmethod`

from_api_key(model: str, api_key: str | None = None, *, base_url: str | None = None, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, json_mode: bool = False, max_connections: int | None = None, referer: str | None = None, title: str | None = None, metadata_extractors: list[MetadataExtractor] | None = None, **client_kwargs: Any) -> OpenRouterModel

Build an OpenRouterModel.

If api_key is None, reads OPENROUTER_API_KEY from the environment and raises ValueError if neither is set. (The OpenAI SDK doesn't know about OPENROUTER_API_KEY, so we have to read it ourselves rather than relying on the SDK's default.)

referer and title map to OpenRouter's optional HTTP-Referer and X-Title headers (used for app attribution on openrouter.ai's leaderboard).

Source code in src/async_batch_llm/models.py

@classmethod
def from_api_key(  # type: ignore[override]
    cls,
    model: str,
    api_key: str | None = None,
    *,
    base_url: str | None = None,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    json_mode: bool = False,
    max_connections: int | None = None,
    referer: str | None = None,
    title: str | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
    **client_kwargs: Any,
) -> "OpenRouterModel":
    """Build an OpenRouterModel.

    If ``api_key`` is None, reads ``OPENROUTER_API_KEY`` from the
    environment and raises ``ValueError`` if neither is set. (The
    OpenAI SDK doesn't know about ``OPENROUTER_API_KEY``, so we have
    to read it ourselves rather than relying on the SDK's default.)

    ``referer`` and ``title`` map to OpenRouter's optional
    ``HTTP-Referer`` and ``X-Title`` headers (used for app attribution
    on openrouter.ai's leaderboard).
    """
    merged_headers = dict(extra_headers) if extra_headers else {}
    if referer is not None:
        merged_headers.setdefault("HTTP-Referer", referer)
    if title is not None:
        merged_headers.setdefault("X-Title", title)

    # super().from_api_key is generic over cls (returns the calling
    # subclass type), so this returns OpenRouterModel directly — no cast
    # needed. See issue #10.
    return super().from_api_key(
        model,
        api_key,
        base_url=base_url,
        system_instruction=system_instruction,
        extra_headers=merged_headers or None,
        extra_body=extra_body,
        json_mode=json_mode,
        max_connections=max_connections,
        metadata_extractors=metadata_extractors,
        **client_kwargs,
    )

DeepSeekModel

async_batch_llm.DeepSeekModel

DeepSeekModel(model: str, client: AsyncOpenAI, *, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, thinking: bool | None = None, metadata_extractors: list[MetadataExtractor] | None = None)

Bases: OpenAICompatibleModel

LLM model backed by DeepSeek's OpenAI-compatible API.

Points at https://api.deepseek.com and reads DEEPSEEK_API_KEY in :meth:from_api_key. Model ids are bare DeepSeek names, e.g. "deepseek-chat" or "deepseek-reasoner".

DeepSeek's automatic context cache reports hits at the top level of the usage object (prompt_cache_hit_tokens / prompt_cache_miss_tokens) rather than under OpenAI's nested prompt_tokens_details.cached_tokens — so this subclass overrides :meth:_extract_tokens to surface them in cached_input_tokens. Use :attr:CachedTokenRates.DEEPSEEK (~2%) when computing billable tokens.

(Calling DeepSeek through OpenRouter uses :class:OpenRouterModel instead; the native cache fields aren't reliably forwarded there, which is why direct access via this class gives better cache telemetry.)

Thinking mode. DeepSeek's V4 models (deepseek-v4-flash / deepseek-v4-pro) default to thinking, which for a batch classification job is a surprising, expensive default — thinking can emit several times the output tokens (and cost, and latency) of non-thinking. Pass thinking=False to force non-thinking mode explicitly rather than relying on the deepseek-chat (non-thinking) / deepseek-reasoner (thinking) aliases, which DeepSeek is deprecating. Under the hood this sends extra_body={"thinking": {"type": "disabled"}}.

Example

model = DeepSeekModel.from_api_key( ... "deepseek-v4-flash", api_key="sk-...", thinking=False ... ) response = await model.generate("Hello!") print(response.text, response.cached_input_tokens)

Added in v0.10.0.

See :class:OpenAICompatibleModel; adds the DeepSeek thinking toggle (True/False to force thinking on/off, None for the model default).

Source code in src/async_batch_llm/models.py

def __init__(
    self,
    model: str,
    client: "AsyncOpenAI",
    *,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    thinking: bool | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
):
    """See :class:`OpenAICompatibleModel`; adds the DeepSeek ``thinking``
    toggle (``True``/``False`` to force thinking on/off, ``None`` for the
    model default)."""
    super().__init__(
        model,
        client,
        system_instruction=system_instruction,
        extra_headers=extra_headers,
        extra_body=_merge_thinking(extra_body, thinking),
        metadata_extractors=metadata_extractors,
    )

from_api_key `classmethod`

from_api_key(model: str, api_key: str | None = None, *, base_url: str | None = None, system_instruction: str | None = None, extra_headers: dict[str, str] | None = None, extra_body: dict[str, Any] | None = None, json_mode: bool = False, max_connections: int | None = None, thinking: bool | None = None, metadata_extractors: list[MetadataExtractor] | None = None, **client_kwargs: Any) -> DeepSeekModel

Build a DeepSeekModel; reads DEEPSEEK_API_KEY when api_key is None. Adds the thinking toggle (see the class docstring) on top of the shared :meth:OpenAICompatibleModel.from_api_key arguments.

Source code in src/async_batch_llm/models.py

@classmethod
def from_api_key(  # type: ignore[override]
    cls,
    model: str,
    api_key: str | None = None,
    *,
    base_url: str | None = None,
    system_instruction: str | None = None,
    extra_headers: dict[str, str] | None = None,
    extra_body: dict[str, Any] | None = None,
    json_mode: bool = False,
    max_connections: int | None = None,
    thinking: bool | None = None,
    metadata_extractors: list[MetadataExtractor] | None = None,
    **client_kwargs: Any,
) -> "DeepSeekModel":
    """Build a DeepSeekModel; reads ``DEEPSEEK_API_KEY`` when ``api_key`` is
    None. Adds the ``thinking`` toggle (see the class docstring) on top of
    the shared :meth:`OpenAICompatibleModel.from_api_key` arguments."""
    return super().from_api_key(
        model,
        api_key,
        base_url=base_url,
        system_instruction=system_instruction,
        extra_headers=extra_headers,
        extra_body=_merge_thinking(extra_body, thinking),
        json_mode=json_mode,
        max_connections=max_connections,
        metadata_extractors=metadata_extractors,
        **client_kwargs,
    )

Protocols

LLMModel

async_batch_llm.LLMModel

Bases: Protocol

Protocol for LLM model instances that can generate responses.

Implementations wrap a specific provider's client and model configuration, handling API calls and response normalization. Strategies call generate() without needing to know about provider-specific details.

Added in v0.6.0.

generate `async`

generate(prompt: str | list[Any], *, temperature: float | None = 0.0, system_instruction: str | None = None, config: dict[str, Any] | None = None) -> LLMResponse

Generate a response from the LLM.

Parameters:

Name	Type	Description	Default
`prompt`	`str \| list[Any]`	Text prompt, or list of content parts for multimodal input.	required
`temperature`	`float \| None`	Sampling temperature (0.0 = deterministic). Pass `None` to omit the parameter entirely and use the provider default — required for models that reject an explicit temperature (e.g. OpenAI reasoning models like o1/o3).	`0.0`
`system_instruction`	`str \| None`	System instruction override (None = use default).	`None`
`config`	`dict[str, Any] \| None`	Provider-specific configuration (e.g., response_mime_type).	`None`

Returns:

Type	Description
`LLMResponse`	Normalized LLMResponse with text, token counts, and metadata.

Source code in src/async_batch_llm/core/protocols.py

async def generate(
    self,
    prompt: str | list[Any],
    *,
    temperature: float | None = 0.0,
    system_instruction: str | None = None,
    config: dict[str, Any] | None = None,
) -> LLMResponse:
    """
    Generate a response from the LLM.

    Args:
        prompt: Text prompt, or list of content parts for multimodal input.
        temperature: Sampling temperature (0.0 = deterministic). Pass ``None``
            to omit the parameter entirely and use the provider default —
            required for models that reject an explicit temperature (e.g.
            OpenAI reasoning models like o1/o3).
        system_instruction: System instruction override (None = use default).
        config: Provider-specific configuration (e.g., response_mime_type).

    Returns:
        Normalized LLMResponse with text, token counts, and metadata.
    """
    ...

ManagedLLMModel

async_batch_llm.ManagedLLMModel

Bases: LLMModel, Protocol

LLMModel with lifecycle management (e.g., caching).

Models that need one-time setup (creating a cache) or cleanup implement this protocol. The strategy delegates prepare/cleanup calls to the model.

Added in v0.6.0.

cleanup `async`

cleanup() -> None

Release resources. Must be idempotent.

Source code in src/async_batch_llm/core/protocols.py

async def cleanup(self) -> None:
    """Release resources. Must be idempotent."""
    ...

prepare `async`

prepare() -> None

Initialize resources (e.g., find or create a cache). Must be idempotent.

Source code in src/async_batch_llm/core/protocols.py

async def prepare(self) -> None:
    """Initialize resources (e.g., find or create a cache). Must be idempotent."""
    ...

LLMResponse

async_batch_llm.LLMResponse `dataclass`

LLMResponse(text: str, input_tokens: int, output_tokens: int, total_tokens: int, cached_input_tokens: int = 0, metadata: dict[str, Any] | None = None, raw: Any = None)

Bases: ProviderOutputViews

Normalized response from any LLM provider.

Returned by LLMModel.generate(). Provides a provider-agnostic interface so strategies don't need to know about Gemini, OpenAI, etc. response formats.

Attributes:

Name	Type	Description
`text`	`str`	The response text content.
`input_tokens`	`int`	Number of input/prompt tokens.
`output_tokens`	`int`	Number of output/completion tokens.
`total_tokens`	`int`	Total tokens used.
`cached_input_tokens`	`int`	Input tokens served from cache (0 if no caching).
`metadata`	`dict[str, Any] \| None`	Provider-specific metadata (safety ratings, finish reason, etc.). The keys `'grounding'`, `'reasoning'`, `'tool_calls'`, and `'logprobs'` are reserved, with documented dict shapes readable through the typed views `.grounding`/`.reasoning`/ `.tool_calls`/`.logprobs` (see `provider_output.py`).
`raw`	`Any`	The raw provider response object, for edge cases.

Added in v0.6.0.

token_usage `property`

token_usage: TokenUsage

Return token counts as a TokenUsage dict.

Strategies API Reference

LLMCallStrategy

async_batch_llm.LLMCallStrategy

cleanup async

dry_run async

execute abstractmethod async

on_error async

prepare async

recommended_error_classifier

ModelStrategy

async_batch_llm.ModelStrategy

cleanup async

execute async

prepare async

PydanticAIStrategy

async_batch_llm.PydanticAIStrategy

dry_run async

execute async

GeminiStrategy

async_batch_llm.GeminiStrategy

With caching:

OpenAIStrategy

async_batch_llm.OpenAIStrategy

Structured output via response_parser:

OpenRouterStrategy

async_batch_llm.OpenRouterStrategy

DeepSeekStrategy

async_batch_llm.DeepSeekStrategy

Models

GeminiModel

async_batch_llm.GeminiModel

generate async

GeminiCachedModel

async_batch_llm.GeminiCachedModel

cache_name property

cleanup async

delete_cache async

generate async

prepare async

OpenAICompatibleModel

async_batch_llm.OpenAICompatibleModel

cleanup async

from_api_key classmethod

generate async

prepare async

OpenAIModel

async_batch_llm.OpenAIModel

OpenRouterModel

async_batch_llm.OpenRouterModel

from_api_key classmethod

DeepSeekModel

async_batch_llm.DeepSeekModel

from_api_key classmethod

Protocols

LLMModel

async_batch_llm.LLMModel

generate async

ManagedLLMModel

async_batch_llm.ManagedLLMModel

cleanup async

prepare async

LLMResponse

async_batch_llm.LLMResponse dataclass

token_usage property

cleanup `async`

dry_run `async`

execute `abstractmethod` `async`

on_error `async`

prepare `async`

cleanup `async`

execute `async`

prepare `async`

dry_run `async`

execute `async`

generate `async`

cache_name `property`

cleanup `async`

delete_cache `async`

generate `async`

prepare `async`

cleanup `async`

from_api_key `classmethod`

generate `async`

prepare `async`

from_api_key `classmethod`

from_api_key `classmethod`

generate `async`

cleanup `async`

prepare `async`

async_batch_llm.LLMResponse `dataclass`

token_usage `property`