Skip to content

πŸ§ͺ Evaluation Framework

The evaluation framework is separate from unit tests. It performs visual regression testing by rendering scenes and comparing with golden PNGs.

πŸ“ Structure

evaluation/
β”œβ”€β”€ cases/      # Scene input JSON files
β”œβ”€β”€ expected/   # Golden PNG images
β”œβ”€β”€ output/     # Rendered PNGs (generated)
└── diffs/      # Diff images on failure

Browse on GitHub: evaluation/

πŸš€ Running Evaluation

uv run talk2scene eval.run=true

πŸ” How It Works

flowchart TD
    A[evaluation/cases/*.json] --> B[Render scene to PNG]
    B --> C{Compare with\nexpected PNG}
    C -->|within tolerance| D[Pass]
    C -->|exceeds tolerance| E[Write diff image]
    D --> F[JSON report\n+ text summary]
    E --> F

πŸ“ Comparison Methods

  • πŸ‘οΈ Pixel diff: Percentage of differing pixels (configurable tolerance)
  • ️⃣ Perceptual hash: Hamming distance between image hashes

πŸ†š Tests vs Evaluation

tests/ evaluation/
🏷️ Type Unit tests Visual regression
πŸ› οΈ Tool pytest Built-in runner
βœ… Checks Logic correctness Render correctness
πŸ“¦ Artifacts - PNG renders + diffs