Is there any playground or equivalent for seeing the tool use / agentic performance of different models? I want to essentially give it 5 tools and my usual prompt And the playground generates ~5-10 cases and runs all the selected models And then I observe+change model, prompt
4,63K