Robustness Tests¶

The tests/robustness/ framework tests output consistency across prompt variations.

Running Tests¶

Standalone Mode (Recommended)¶

# Run as script (shows detailed output)
uv run python tests/robustness/test_chembl_interactivity.py

# With timeout
timeout 600 uv run python tests/robustness/test_chembl_interactivity.py

Pytest Mode¶

uv run pytest tests/robustness/ -v

# Run specific test file
uv run pytest tests/robustness/test_chembl_interactivity.py -v

Warning

Pytest mode may exhaust memory (exit code 137). Use standalone mode if this happens.

Config-Driven Mode¶

# Run specific tests with custom variations
uv run python tests/robustness/robustness_minimal_example.py \
  --test chembl_download --n-variations 3

# List available tests
uv run python tests/robustness/robustness_minimal_example.py --list-tests

Session Isolation¶

Each prompt variation runs in complete isolation:

Fresh Agent Teams: Each variation creates a new agent team from scratch
Disabled Memory: Agent memory is disabled (enable_memory=False)
Isolated S3 Storage: Each variation gets a unique S3 prefix
No State Leakage: Agents cannot see results from previous variations

This ensures tests measure robustness to prompt variation, not side effects from memory or shared state.

Robustness Score¶

Score = 0.4 × Data + 0.3 × Semantic + 0.2 × Process + 0.1 × Visual

Rating	Score
Excellent	>= 0.90
Good	>= 0.80
Acceptable	>= 0.70
Concerning	< 0.70

Framework Structure¶

tests/robustness/
├── test_chembl_interactivity.py       # ChEMBL clarification flow tests
├── test_pipeline_robustness.py        # Full pipeline robustness tests
├── test_autoencoder_robustness.py     # Autoencoder operation tests
├── robustness_minimal_example.py      # Config-driven test runner
├── conftest.py                        # Shared pytest fixtures
├── test_utils.py                      # Core utilities
├── tool_tracker.py                    # Tool sequence tracking
├── config_schema.py                   # Configuration validation
├── prompt_variations.py               # Prompt variation generator
├── comparators.py                     # Output comparison utilities
├── metrics.py                         # Robustness scoring
├── robustness_config.yaml             # Test configuration
└── fixtures/
    └── prompt_templates.yaml          # Prompt variations database

Troubleshooting¶

Issue	Solution
Tests hang	Add `timeout 300` wrapper
Exit code 137 (OOM)	Use standalone mode instead of pytest
"Database is locked"	Remove `.agno/` directory
MinIO not accessible	Set `USE_S3=false`