architecture.rst¶
Architecture¶
The Dataset Generator is built as a configurable directed graph of processing nodes. The high-level pipeline and core components are illustrated below.
High-Level Pipeline¶
flowchart TD
classDef expensiveModel stroke:#d55,stroke-width:2px
classDef cheapModel stroke:#5d5,stroke-width:2px
A0[Start] --> TopicGen
TopicGen[Topic Generation] --> Email
Email[Email Text] --> Answer
Answer[Answer Text] --> Tags
Tags[Add Tags] --> Rephrase
Rephrase --> Translate
Translate --> End
class TopicGen,Email,Answer expensiveModel
class Tags,Rephrase,Translate cheapModel
TopicGen, Email, and Answer nodes use “expensive” models (e.g. GPT-4) to ensure high quality.
Tags, Rephrase, and Translate use cheaper models (e.g. Qwen-7B) for cost efficiency.
Each node reads and enriches a shared context object, caching its output so downstream nodes can reuse results without recomputing.
Graph Runner and Nodes¶
GraphRunner: The entry point that triggers execution of the graph from all terminal nodes. Calling run() executes each node in order, respecting dependencies.
Graph: Maintains the parent→child relationships of all Node objects in the pipeline.
Node: Orchestrates caching. If a node’s result is already computed (in its Cache), it loads it; otherwise it creates a private _ExecutionNode to compute the output. Nodes pass data via a shared context.
AINode: A subclass of Node that wraps an AI assistant. It generates prompts and calls the assistant to produce text outputs.
Storage: A global key-value store for persisting intermediate data across the pipeline.
Assistants & Data Models¶
AIAssistant: Abstracts any LLM API (e.g. OpenAI, local LLM, etc.) behind a uniform interface. Developers can plug in different model endpoints here.
PromptBuilder: Constructs prompts in a structured (JSON-schema) format. It takes an InputType definition, adds current context fields, and specifies the expected OutputType. This ensures inputs and outputs are validated.
InputType / OutputType: Pydantic dataclasses (JSON-schema models) that declare the fields for prompts and responses. They include validation rules and documentation for each field.
Randomized Inputs¶
To introduce variability without AI:
IRandom: An interface for classes that can generate random values on demand.
RandomCollection: Holds a list of values with associated weights. It optionally applies small random perturbations to the weights each draw to simulate noise, then samples a random element.
RandomTable: Maps a key (often an enum, e.g. ticket type) to a RandomCollection of values (e.g. possible priorities for that ticket type). This allows conditional sampling (different priorities for bugs vs. feature requests, etc.).
Cost & Usage Analysis¶
The system tracks token counts and cost throughout:
AssistantRun: Records the token usage and cost of a single run_openai (or equivalent) call for one node.
AssistantRunsAnalyzer: Aggregates multiple AssistantRun records (e.g. from one pipeline run) into an AnalysisResult summarizing total tokens and cost.
DatasetUsageAnalyzer: Gathers cost information across all generated tickets and produces a formatted summary (breakdown per assistant and totals in any currency). It can generate reports like CSV or tables.
These analyzers help you audit the monetary cost of generation by agent. See the Usage section for examples of generating cost reports.