Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2/x] Support non-OAI providers as LLM judges for GenAI metrics. #13717

Merged
merged 8 commits into from
Nov 11, 2024

Conversation

B-Step62
Copy link
Collaborator

@B-Step62 B-Step62 commented Nov 8, 2024

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

MLflow LLM Evaluation support the following types of judges: (1) OpenAI models (2) Databricks Model Serving (3) MLflow Gateway endpoints. This PR adds support for a few more proprietary LLM providers like anthropic, bedrock, mistral in response to repeating user requests.

# We only support OpenAI as a LLM provider for the judge now
answer_correctness("openai:/gpt-4o-mini")
# This PR expand it to other providers
answer_correctness("anthropic/claude-1.3-100k")

The main challenge here is how to construct request payload and parse response. Fortunately, we already solved this problem once - MLflow Gateway. The adapter's implementation for different LLM providers translates the vendor-dependent format into a unified chat format.

A few key notes:

  1. This PR does not adds support for all providers in MLflow Gateway, but rather starting from a few famous ones with chat models.
  2. Half of this PR is minor refactoring / modification to the provider implementation. For example, adding adapter property so we can get the adapter class conveniently.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests
Anthropic
Screenshot 2024-11-08 at 16 31 14
Bedrock
Screenshot 2024-11-08 at 16 32 20
Mistral
Screenshot 2024-11-08 at 16 31 38

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Will update LLM evaluation doc in a follow-up PR before the release.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Enhance MLflow GenAI metrics (LLM-as-a-judge) to support more LLM providers than OpenAI, such as Anthropic, Amazon Bedrock, Mistral, etc.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Copy link

github-actions bot commented Nov 8, 2024

Documentation preview for becb1e2 will be available when this CircleCI job
completes successfully.

More info

@B-Step62 B-Step62 force-pushed the eval-judge-2-providers branch 3 times, most recently from a397381 to ab3a0e7 Compare November 8, 2024 07:09
Signed-off-by: B-Step62 <[email protected]>
@B-Step62 B-Step62 force-pushed the eval-judge-2-providers branch from ab3a0e7 to f63c601 Compare November 8, 2024 07:20
Signed-off-by: B-Step62 <[email protected]>
@B-Step62 B-Step62 changed the title [2/x; WIP] Support non-OAI providers as LLM judges for GenAI metrics. [2/x] Support non-OAI providers as LLM judges for GenAI metrics. Nov 8, 2024
@B-Step62 B-Step62 marked this pull request as ready for review November 8, 2024 07:24
@github-actions github-actions bot added patch-2.17.3 area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Nov 8, 2024
Signed-off-by: B-Step62 <[email protected]>
@@ -17,6 +17,7 @@ class AnthropicAdapter(ProviderAdapter):
@classmethod
def chat_to_model(cls, payload, config):
key_mapping = {"stop": "stop_sequences"}
payload["model"] = config.model.name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the "model" key was set in the async hander before (see diff at L261-). Moving it inside the adapter as we want request transformation logic to complete within the adapter.

self.base_url = "https://api.anthropic.com/v1/"

@property
def base_url(self) -> str:
Copy link
Collaborator Author

@B-Step62 B-Step62 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: accessors for convinience and simplicity in the consumer side. Eventually these should be added to the base provider class and implemented in all providers, but keeping it optional in this PR to reduce the size of changes.

@@ -54,6 +54,36 @@ def model_to_completions(cls, resp, config):
),
)

@classmethod
def model_to_chat(cls, resp, config):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: We only have completion endpoint support for Mistral now. However, their endpoint is chat fornat (OpenAI compatible) .

Comment on lines 315 to 317
@property
def adapter(self):
return TogetherAIAdapter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an abstract property?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, but can do in a follow-up. See #13717 (comment) for why I didn't add it in this PR.

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: B-Step62 <[email protected]>
Signed-off-by: B-Step62 <[email protected]>
Signed-off-by: B-Step62 <[email protected]>
@B-Step62 B-Step62 added this pull request to the merge queue Nov 11, 2024
Merged via the queue into mlflow:master with commit 7d4865b Nov 11, 2024
41 checks passed
@B-Step62 B-Step62 deleted the eval-judge-2-providers branch November 11, 2024 13:15
dsuhinin pushed a commit to dsuhinin/mlflow that referenced this pull request Nov 14, 2024
@pazevedo-hyland
Copy link

@B-Step62 any chance you can add documentation on how to setup the enviroment variables for this?
I'm not sure how I would go about choosing the region etc.. for the judge when calling bedrock as judge

This doesn't work for example
image

image

@B-Step62
Copy link
Collaborator Author

@pazevedo-hyland You can set the region using the AWS_REGION environment variable and don't need to pass them to the parameters. We will add this info to the documentation.

import mlflow
import os

os.environ["AWS_REGION"] = "us-west-2"
os.environ["AWS_ACCESS_KEY_ID"] = "<your_access_key>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<your_secret_access_key>"

my_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="bedrock:/anthropic.claude-3-5-sonnet-20241022-v2:0",
    parameters={
        "temperature": 0,
        "max_tokens": 1000,
        "anthropic_version": "bedrock-2023-05-31"
    },
)

@pazevedo-hyland
Copy link

@B-Step62 yup, that fixes it! Thanks,

Any chance you can add in the docs as well which parameters can be passed for each of the providers?

@B-Step62
Copy link
Collaborator Author

B-Step62 commented Nov 25, 2024

Any chance you can add in the docs as well which parameters can be passed for each of the providers?

Parameters are simply passed to the providers' model endpoints within a request payload. Which ones are accepted/rejected is not in our control, so we cannot add and maintain documentation for that. Even different models from the same provider can have different requirements. API reference provided by each provider should be more reliable source for the information🙂

karthikkurella pushed a commit to karthikkurella/mlflow that referenced this pull request Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. v2.17.3 v2.18.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants