Documentation

Topics Overview Overview Linux macOS Windows VS Code for the Web Raspberry Pi Network Additional Components Uninstall VS Code Tutorial Copilot Quickstart User Interface Personalize VS Code Install Extensions Tips and Tricks Intro Videos Overview Setup Quickstart Overview Language Models Context Tools Agents Customization Trust & Safety Overview Agents Tutorial Agents Window Planning Memory Tools Subagents Local Agents Copilot CLI Cloud Agents Third-Party Agents Overview Chat Sessions Add Context Inline Chat Review Edits Checkpoints Artifacts Panel Debug Chat Interactions Prompt Examples Overview Instructions Prompt Files Custom Agents Agent Skills Language Models MCP Hooks Plugins Context Engineering Customize AI Test-Driven Development Edit Notebooks with AI Test with AI Test Web Apps with Browser Tools Debug with AI MCP Dev Guide OpenTelemetry Monitoring Inline Suggestions Smart Actions Best Practices Security Troubleshooting FAQ Cheat Sheet Settings Reference MCP Configuration Workspace Context Display Language Layout Keyboard Shortcuts Settings Settings Sync Extension Marketplace Extension Runtime Security Themes Profiles Overview Voice Interactions Command Line Interface Telemetry Basic Editing IntelliSense Code Navigation Refactoring Snippets Overview Multi-Root Workspaces Workspace Trust Tasks Debugging Debug Configuration Testing Port Forwarding Integrated Browser Overview Quickstart Staging & Committing Branches & Worktrees Repositories & Remotes Merge Conflicts Collaborate on GitHub Troubleshooting FAQ Getting Started Tutorial Terminal Basics Terminal Profiles Shell Integration Appearance Advanced Overview Enterprise Policies AI Settings Extensions Telemetry Updates Overview JavaScript JSON HTML Emmet CSS, SCSS and Less TypeScript Markdown PowerShell C++ Java PHP Python Julia R Ruby Rust Go T-SQL C# .NET Swift Working with JavaScript Node.js Tutorial Node.js Debugging Deploy Node.js Apps Browser Debugging Angular Tutorial React Tutorial Vue Tutorial Debugging Recipes Performance Profiling Extensions Tutorial Transpiling Editing Refactoring Debugging Quick Start Tutorial Run Python Code Editing Linting Formatting Debugging Environments Testing Python Interactive Django Tutorial FastAPI Tutorial Flask Tutorial Create Containers Deploy Python Apps Python in the Web Settings Reference Getting Started Navigate and Edit Refactoring Formatting and Linting Project Management Build Tools Run and Debug Testing Spring Boot Modernizing Java Apps Application Servers Deploy Java Apps GUI Applications Extensions FAQ Intro Videos GCC on Linux GCC on Windows GCC on Windows Subsystem for Linux Clang on macOS Microsoft C++ on Windows Build with CMake CMake Tools on Linux CMake Quick Start C++ Dev Tools for Copilot Editing and Navigating Debugging Configure Debugging Refactoring Settings Reference Configure IntelliSense Configure IntelliSense for Cross-Compiling FAQ Intro Videos Get Started Navigate and Edit IntelliCode Refactoring Formatting and Linting Project Management Build Tools Package Management Run and Debug Testing FAQ Overview Node.js Python ASP.NET Core Debug Docker Compose Registries Deploy to Azure Choose a Dev Environment Customize Develop with Kubernetes Tips and Tricks Overview Jupyter Notebooks Data Science Tutorial Python Interactive Data Wrangler Quick Start Data Wrangler PyTorch Support Azure Machine Learning Manage Jupyter Kernels Jupyter Notebooks on the Web Data Science in Microsoft Fabric Foundry Toolkit Overview Foundry Toolkit Copilot Tools Create Agents Models Playground Agent Builder Agent Inspector Evaluation Tool Catalog Fine-Tuning (Automated Setup) Fine-Tuning (Project Template) Model Conversion Tracing Profiling (Windows ML) FAQ File Structure Manual Model Conversion Manual Model Conversion on GPU Setup Environment Without Foundry Toolkit Template Project Migrating from Visualizer to Agent Inspector Overview Getting Started Resources View Deployment VS Code for the Web - Azure Containers Azure Kubernetes Service Kubernetes MongoDB Remote Debugging for Node.js Overview SSH Dev Containers Windows Subsystem for Linux GitHub Codespaces VS Code Server Tunnels SSH Tutorial WSL Tutorial Tips and Tricks FAQ Overview Tutorial Attach to Container Create Dev Container Advanced Containers devcontainer.json Dev Container CLI Tips and Tricks FAQ Default Keyboard Shortcuts Default Settings Substitution Variables Tasks Schema

On this page there are 7 sections

Evaluate models, prompts, and agents

You can evaluate models, prompts, and agents by comparing their outputs to ground truth data and computing evaluation metrics. Foundry Toolkit streamlines this process. Upload datasets and run comprehensive evaluations with minimal effort.

Evaluate prompts and agents

You can evaluate prompts and agents in Agent Builder by selecting the Evaluation tab. Before you evaluate, run your prompts or agents against a dataset.

To evaluate prompts or agents:

In Agent Builder, select the Evaluation tab.
Add and run the dataset you want to evaluate.
Use the thumbs up and down icons to rate responses and keep a record of your manual evaluation.
To add an evaluator, select New Evaluation.
Select an evaluator from the list of built-in evaluators, such as F1 score, relevance, coherence, or similarity.
Note
Rate limits might apply when using GitHub-hosted models to run the evaluation.
Select a model to use as a judging model for the evaluation, if required.
Select Run Evaluation to start the evaluation job.

Versioning and evaluation comparison

Foundry Toolkit supports versioning of prompts and agents, so you can compare the performance of different versions. When you create a new version, you can run evaluations and compare results with previous versions.

To save a new version of a prompt or agent:

In Agent Builder, define the system or user prompt, add variables and tools.
Run the agent or switch to the Evaluate tab and add a dataset to evaluate.
When you are satisfied with the prompt or agent, select Save as New Version from the toolbar.
Optionally, provide a version name and press Enter.

View version history

You can view the version history of a prompt or agent in Agent Builder. The version history shows all versions, along with evaluation results for each version.

In version history view, you can:

Select the pencil icon next to the version name to rename a version.
Select the trash icon to delete a version.
Select a version name to switch to that version.

Compare evaluation results between versions

You can compare evaluation results of different versions in Agent Builder. Results are displayed in a table, showing scores for each evaluator and the overall score for each version.

To compare evaluation results between versions:

In Agent Builder, select the Evaluation tab.
From the evaluation toolbar, select Compare.
Choose the version you want to compare with from the list.
Note
Compare functionality is only available in full screen mode of Agent Builder for better visibility of the evaluation results. You can expand the Prompt section to see the model and prompt details.
The evaluation results for the selected version are displayed in a table, allowing you to compare the scores for each evaluator and the overall score for each version.

Built-in evaluators

Foundry Toolkit provides a set of built-in evaluators to measure the performance of your models, prompts, and agents. These evaluators compute various metrics based on your model outputs and ground truth data.

For agents:

Intent Resolution: Measures how accurately the agent identifies and addresses user intentions.
Task Adherence: Measures how well the agent follows through on identified tasks.
Tool Call Accuracy: Measures how well the agent selects and calls the correct tools.

For general purposes:

Coherence: Measures logical consistency and flow of responses.
Fluency: Measures natural language quality and readability.

For RAG (Retrieval Augmented Generation):

Retrieval: Measures how effectively the system retrieves relevant information.

For textual similarity:

Similarity: AI-assisted textual similarity measurement.
F1 Score: Harmonic mean of precision and recall in token overlaps between response and ground truth.
BLEU: Bilingual Evaluation Understudy score for translation quality; measures overlaps in n-grams between response and ground truth.
GLEU: Google-BLEU variant for sentence-level assessment; measures overlaps in n-grams between response and ground truth.
METEOR: Metric for Evaluation of Translation with Explicit Ordering; measures overlaps in n-grams between response and ground truth.

The evaluators in Foundry Toolkit are based on the Azure Evaluation SDK. To learn more about observability for generative AI models, see the Microsoft Foundry documentation.

Start a standalone evaluation job

In the Foundry Toolkit view, select TOOLS > Evaluation to open the Evaluation view.
Select Create Evaluation, then provide the following information:
- Evaluation job name: Use the default or enter a custom name.
- Evaluator: Select from built-in or custom evaluators.
- Judging model: Select a model to use as the judging model, if required.
- Dataset: Select a sample dataset for learning, or import a JSONL file with the fields query, response, and ground truth.
A new evaluation job is created. You are prompted to open the evaluation job details.
Verify your dataset and select Run Evaluation to start the evaluation.

Monitor the evaluation job

After you start an evaluation job, you can view its status in the evaluation job view.

Each evaluation job includes a link to the dataset used, logs from the evaluation process, a timestamp, and a link to the evaluation details.

Find results of evaluation

The evaluation job details view shows a table of results for each selected evaluator. Some results might include aggregate values.

You can also select Open In Data Wrangler to open the data with the Data Wrangler extension.

Create custom evaluators

You can create custom evaluators to extend the built-in evaluation capabilities of Foundry Toolkit. Custom evaluators let you define your own evaluation logic and metrics.

To create a custom evaluator:

In the Evaluation view, select the Evaluators tab.
Select Create Evaluator to open the creation form.
Provide the required information:
- Name: Enter a name for your custom evaluator.
- Description: Describe what the evaluator does.
- Type: Select the type of evaluator: LLM-based or Code-based (Python).
Follow the instructions for the selected type to complete the setup.
Select Save to create the custom evaluator.
After you create the custom evaluator, it appears in the list of evaluators for selection when you create a new evaluation job.

LLM-based evaluator

For LLM-based evaluators, define the evaluation logic using a natural language prompt.

Write a prompt to guide the evaluator in assessing specific qualities. Define criteria, provide examples, and use variables like or for flexibility. Customize the scale or feedback style as needed.

Make sure the LLM outputs a JSON result, for example: {"score": 4, "reason": "The response is relevant but lacks detail."}

You can also use the Examples section to get started with your LLM-based evaluator.

Code-based evaluator

For code-based evaluators, define the evaluation logic using Python code. The code should return a JSON result with the evaluation score and reason.

Foundry Toolkit provides a scaffold based on your evaluator name and whether you use an external library.

You can modify the code to implement your evaluation logic:

python3 -m venv .venv && source .venv/bin/activate && pip install uv && uv pip install -r requirements.txt --prerelease=allow

Select the Python environment in VS Code. Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P), run Python: Select Interpreter, and select the new created environment.

Open the .env file and verify your configuration. It is pre-configured with the connection information to your agent in Foundry.

Open the data.jsonl file. This contains example data in the JSONL format. You should modify this data, possibly adding additional properties and values depending on the type of evaluator you selected. For example, some evaluators may need a combination of query, response, context, ground_truth or other properties. You could add your own custom properties and handle them in your test harness logic.

Open the Testing panel in VS Code (select the flask icon in the Activity Bar)

Select the play button next to the testing code to run all tests.

View results in the Test Results panel.

What you learned

In this article, you learned how to:

Create and run evaluation jobs in Foundry Toolkit for VS Code.
Monitor the status of evaluation jobs and view their results.
Compare evaluation results between different versions of prompts and agents.
View version history for prompts and agents.
Use built-in evaluators to measure performance with various metrics.
Create custom evaluators to extend the built-in evaluation capabilities.
Use LLM-based and code-based evaluators for different evaluation scenarios.

07/14/2025