evals

Star

Here are 36 public repositories matching this topic...

mastra-ai / mastra

Star

The TypeScript AI agent framework. ⚡ Assistants, RAG, observability. Supports any LLM: GPT-4, Claude, Gemini, Llama.

nodejs javascript typescript ai reactjs mcp nextjs tts chatbots workflows agents llm evals

Updated Apr 5, 2025
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Apr 4, 2025
Jupyter Notebook

AgentOps-AI / agentops

Star

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including OpenAI Agents SDK, CrewAI, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

Updated Apr 5, 2025
Python

Kiln-AI / Kiln

Star

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

python windows macos machine-learning ai evaluation prompt ml collaboration openai dataset-generation synthetic-data fine-tuning prompt-engineering chain-of-thought rlhf evals ollama

Updated Apr 4, 2025
Python

lmnr-ai / lmnr

Star

Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.

open-source ai monitoring analytics evaluation self-hosted rust-lang developer-tools agents observability pipeline-builder aiops rag ai-observability llmops evals llm-evaluation llm-observability llm-workflow

Updated Apr 4, 2025
TypeScript

superlinear-ai / raglite

Star

🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite

Updated Apr 3, 2025
Python

mattpocock / evalite

Sponsor

Star

Test your LLM-powered apps with TypeScript. No API key required.

typescript ai evals

Updated Apr 4, 2025
TypeScript

keshik6 / HourVideo

Star

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

navigation perception summarization reasoning visual-reasoning egocentric-videos gpt-4 multiple-choice-questions benchmark-dataset video-language-understanding multimodal-large-language-models evals gemini-pro spatial-intelligence neurips-2024 1-hour-video-language-understanding long-form-video-language-understanding long-context-understanding

Updated Mar 7, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Apr 4, 2025
TypeScript

dustalov / evalica

Sponsor

Star

Evalica, your favourite evaluation toolkit

Updated Apr 2, 2025
Python

AIAnytime / rag-evaluator

Star

A library for evaluating Retrieval-Augmented Generation (RAG) systems (The traditional ways).

eval rag evals

Updated Aug 10, 2024
Python

flexpa / llm-fhir-eval

Star

Benchmarking Large Language Models for FHIR

fhir fhirpath fhir-resources llm evals llm-evaluation-framework fhir-llm

Updated Nov 29, 2024

NirantK / rag-to-riches

Star

search rag evals

Updated Oct 19, 2024
Jupyter Notebook

maragudk / gai

Sponsor

Star

Go Artificial Intelligence (GAI) helps you work with foundational models, large language models, and other AI models.

go ai embeddings eval llm evals

Updated Mar 31, 2025
Go

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Feb 27, 2025
Python

root-signals / rs-python-sdk

Star

Root Signals Python SDK

evaluation observability llm evals llm-as-a-judge

Updated Mar 31, 2025
Python

google / curie

Star

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

science data llm evals

Updated Apr 3, 2025
Jupyter Notebook

openlayer-ai / templates

Star

Our curated collection of templates. Use these patterns to set up your AI projects for evaluation with Openlayer.

ai examples evals

Updated Feb 26, 2025
Python

root-signals / root-signals-mcp

Star

MCP for Root Signals Evaluation Platform

mcp evals llm-as-a-judge model-context-protocol

Updated Apr 3, 2025
Python

BBischof / mindAgents

Star

AI Agents play The Mind card game

ai agents game-evaluation evals

Updated Feb 18, 2025
Python

Improve this page

Add a description, image, and links to the evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals

Here are 36 public repositories matching this topic...

mastra-ai / mastra

Arize-ai / phoenix

AgentOps-AI / agentops

Kiln-AI / Kiln

lmnr-ai / lmnr

superlinear-ai / raglite

mattpocock / evalite

keshik6 / HourVideo

METR / vivaria

dustalov / evalica

AIAnytime / rag-evaluator

flexpa / llm-fhir-eval

NirantK / rag-to-riches

maragudk / gai

The-Swarm-Corporation / StatisticalModelEvaluator

root-signals / rs-python-sdk

google / curie

openlayer-ai / templates

root-signals / root-signals-mcp

BBischof / mindAgents

Improve this page

Add this topic to your repo