architecture.rst

Architecture

The Dataset Generator is built as a configurable directed graph of processing nodes. The high-level pipeline and core components are illustrated below.

High-Level Pipeline

        flowchart TD
  classDef expensiveModel stroke:#d55,stroke-width:2px
  classDef cheapModel   stroke:#5d5,stroke-width:2px

  A0[Start]           --> TopicGen
  TopicGen[Topic Generation] --> Email
  Email[Email Text]   --> Answer
  Answer[Answer Text] --> Tags
  Tags[Add Tags]      --> Rephrase
  Rephrase            --> Translate
  Translate           --> End

  class TopicGen,Email,Answer expensiveModel
  class Tags,Rephrase,Translate cheapModel
    
  1. TopicGen, Email, and Answer nodes use “expensive” models (e.g. GPT-4) to ensure high quality.

  2. Tags, Rephrase, and Translate use cheaper models (e.g. Qwen-7B) for cost efficiency.

  3. Each node reads and enriches a shared context object, caching its output so downstream nodes can reuse results without recomputing.

Graph Runner and Nodes

@startuml
package graph {
  class GraphRunner {
    + run(): List<Any>
  }
  class Graph {
    - endNodes: List<INode>
  }
  class Node {
    - parents: List<INode>
    - was_executed: bool
    + {async} execute(): Any
  }
  class _ExecutionNode <<private>> {
    + execute(inputs): Any
  }
  class Cache <<private>> {
    - stored_result: Any
  }
  GraphRunner --> Graph
  Graph --> Node
  Node --> _ExecutionNode
  Node --> Cache
}
package ai_graph {
  class AINode extends graph.Node {
    - chat_assistant: ChatAssistant
    + {async} execute(inputs)
  }
  class Storage {
    - data: Map<String,Object>
    + save(data)
    + load(): Object
  }
}
@enduml

  • GraphRunner: The entry point that triggers execution of the graph from all terminal nodes. Calling run() executes each node in order, respecting dependencies.

  • Graph: Maintains the parent→child relationships of all Node objects in the pipeline.

  • Node: Orchestrates caching. If a node’s result is already computed (in its Cache), it loads it; otherwise it creates a private _ExecutionNode to compute the output. Nodes pass data via a shared context.

  • AINode: A subclass of Node that wraps an AI assistant. It generates prompts and calls the assistant to produce text outputs.

  • Storage: A global key-value store for persisting intermediate data across the pipeline.

Assistants & Data Models

@startuml
package ai_graph {
  class AIAssistant {
    + get_response(prompt): string
  }
  class PromptBuilder {
    + add_input()
    + set_output()
  }
  class InputType <<DataModel>> {
    + get_description(): string
  }
  class OutputType <<DataModel>> {
    + get_description(): string
  }
}
@enduml

  • AIAssistant: Abstracts any LLM API (e.g. OpenAI, local LLM, etc.) behind a uniform interface. Developers can plug in different model endpoints here.

  • PromptBuilder: Constructs prompts in a structured (JSON-schema) format. It takes an InputType definition, adds current context fields, and specifies the expected OutputType. This ensures inputs and outputs are validated.

  • InputType / OutputType: Pydantic dataclasses (JSON-schema models) that declare the fields for prompts and responses. They include validation rules and documentation for each field.

Randomized Inputs

@startuml
package random {
  interface IRandom<V> {
    + get_random(): V
  }
  class RandomCollection<V> implements IRandom<V> {
    - values: List<V>
    - weights: List<float>
    + get_random(): V
  }
  class RandomTable<K,V> implements IRandom<V> {
    - rows: Map<K,RandomCollection<V>>
    + get_random(key:K): V
  }
}
@enduml

To introduce variability without AI:

  • IRandom: An interface for classes that can generate random values on demand.

  • RandomCollection: Holds a list of values with associated weights. It optionally applies small random perturbations to the weights each draw to simulate noise, then samples a random element.

  • RandomTable: Maps a key (often an enum, e.g. ticket type) to a RandomCollection of values (e.g. possible priorities for that ticket type). This allows conditional sampling (different priorities for bugs vs. feature requests, etc.).

Cost & Usage Analysis

@startuml
package analysis {
  interface Analyzer {
    + get_cost_analysis(): AnalysisResult
  }
  class AssistantRun implements Analyzer {
    + get_cost_analysis(): AnalysisResult
  }
  class AssistantRunsAnalyzer implements Analyzer {
    + get_cost_analysis(): AnalysisResult
  }
  class DatasetUsageAnalyzer implements Analyzer {
    + generate_cost_summary(): List[FormattedAnalysis]
  }
  class AnalysisResult {
    + prompt_tokens: int
    + prompts_cost: Money
    + completion_tokens: int
    + completions_cost: Money
    + total_cost: Money
  }
}
@enduml

The system tracks token counts and cost throughout:

  • AssistantRun: Records the token usage and cost of a single run_openai (or equivalent) call for one node.

  • AssistantRunsAnalyzer: Aggregates multiple AssistantRun records (e.g. from one pipeline run) into an AnalysisResult summarizing total tokens and cost.

  • DatasetUsageAnalyzer: Gathers cost information across all generated tickets and produces a formatted summary (breakdown per assistant and totals in any currency). It can generate reports like CSV or tables.

These analyzers help you audit the monetary cost of generation by agent. See the Usage section for examples of generating cost reports.