realistic and structured evaluation scenarios for LLM-based agents. You'll create test cases that simulate human-performed tasks... and define gold-standard behavior to compare agent actions against. You'll work to ensure each scenario is clearly defined, well...