[2/x] Support non-OAI providers as LLM judges for GenAI metrics. #13717

B-Step62 · 2024-11-08T00:56:58Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

MLflow LLM Evaluation support the following types of judges: (1) OpenAI models (2) Databricks Model Serving (3) MLflow Gateway endpoints. This PR adds support for a few more proprietary LLM providers like anthropic, bedrock, mistral in response to repeating user requests.

# We only support OpenAI as a LLM provider for the judge now
answer_correctness("openai:/gpt-4o-mini")
# This PR expand it to other providers
answer_correctness("anthropic/claude-1.3-100k")

The main challenge here is how to construct request payload and parse response. Fortunately, we already solved this problem once - MLflow Gateway. The adapter's implementation for different LLM providers translates the vendor-dependent format into a unified chat format.

A few key notes:

This PR does not adds support for all providers in MLflow Gateway, but rather starting from a few famous ones with chat models.
Half of this PR is minor refactoring / modification to the provider implementation. For example, adding adapter property so we can get the adapter class conveniently.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Anthropic

Bedrock

Mistral

Does this PR require documentation update?

Will update LLM evaluation doc in a follow-up PR before the release.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Enhance MLflow GenAI metrics (LLM-as-a-judge) to support more LLM providers than OpenAI, such as Anthropic, Amazon Bedrock, Mistral, etc.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2024-11-08T00:57:30Z

Documentation preview for becb1e2 will be available when this CircleCI job
completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/11778188796.

Signed-off-by: B-Step62 <[email protected]>

B-Step62 · 2024-11-08T08:51:20Z

mlflow/gateway/providers/anthropic.py

@@ -17,6 +17,7 @@ class AnthropicAdapter(ProviderAdapter):
    @classmethod
    def chat_to_model(cls, payload, config):
        key_mapping = {"stop": "stop_sequences"}
+        payload["model"] = config.model.name


note: the "model" key was set in the async hander before (see diff at L261-). Moving it inside the adapter as we want request transformation logic to complete within the adapter.

B-Step62 · 2024-11-08T08:52:10Z

mlflow/gateway/providers/anthropic.py

-        self.base_url = "https://api.anthropic.com/v1/"
+
+    @property
+    def base_url(self) -> str:


note: accessors for convinience and simplicity in the consumer side. Eventually these should be added to the base provider class and implemented in all providers, but keeping it optional in this PR to reduce the size of changes.

B-Step62 · 2024-11-08T08:55:18Z

mlflow/gateway/providers/mistral.py

@@ -54,6 +54,36 @@ def model_to_completions(cls, resp, config):
            ),
        )

+    @classmethod
+    def model_to_chat(cls, resp, config):


note: We only have completion endpoint support for Mistral now. However, their endpoint is chat fornat (OpenAI compatible) .

mlflow/metrics/genai/model_utils.py

mlflow/gateway/providers/anthropic.py

mlflow/gateway/provider_registry.py

harupy · 2024-11-11T09:28:52Z

mlflow/gateway/providers/togetherai.py

+    @property
+    def adapter(self):
+        return TogetherAIAdapter


Is this an abstract property?

Nope, but can do in a follow-up. See #13717 (comment) for why I didn't add it in this PR.

mlflow/metrics/genai/model_utils.py

mlflow/gateway/providers/bedrock.py

mlflow/gateway/providers/anthropic.py

harupy

LGTM, https://github.com/mlflow/mlflow/pull/13717/files#r1836413500 is not a blocker!

Signed-off-by: B-Step62 <[email protected]>

mlflow/metrics/genai/model_utils.py

Signed-off-by: B-Step62 <[email protected]>

…low#13717) Signed-off-by: B-Step62 <[email protected]> Signed-off-by: Software Developer <[email protected]>

pazevedo-hyland · 2024-11-22T09:59:30Z

@B-Step62 any chance you can add documentation on how to setup the enviroment variables for this?
I'm not sure how I would go about choosing the region etc.. for the judge when calling bedrock as judge

This doesn't work for example

B-Step62 · 2024-11-25T08:34:11Z

@pazevedo-hyland You can set the region using the AWS_REGION environment variable and don't need to pass them to the parameters. We will add this info to the documentation.

import mlflow
import os

os.environ["AWS_REGION"] = "us-west-2"
os.environ["AWS_ACCESS_KEY_ID"] = "<your_access_key>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<your_secret_access_key>"

my_answer_similarity = mlflow.metrics.genai.answer_similarity(
    model="bedrock:/anthropic.claude-3-5-sonnet-20241022-v2:0",
    parameters={
        "temperature": 0,
        "max_tokens": 1000,
        "anthropic_version": "bedrock-2023-05-31"
    },
)

pazevedo-hyland · 2024-11-25T09:16:18Z

@B-Step62 yup, that fixes it! Thanks,

Any chance you can add in the docs as well which parameters can be passed for each of the providers?

B-Step62 · 2024-11-25T09:25:20Z

Any chance you can add in the docs as well which parameters can be passed for each of the providers?

Parameters are simply passed to the providers' model endpoints within a request payload. Which ones are accepted/rejected is not in our control, so we cannot add and maintain documentation for that. Even different models from the same provider can have different requirements. API reference provided by each provider should be more reliable source for the information🙂

…low#13717) Signed-off-by: B-Step62 <[email protected]> Signed-off-by: k99kurella <[email protected]>

Support more LLM providers in LLM-as-a-judge

50d49b2

Signed-off-by: B-Step62 <[email protected]>

B-Step62 force-pushed the eval-judge-2-providers branch 3 times, most recently from a397381 to ab3a0e7 Compare November 8, 2024 07:09

Fix params

f63c601

Signed-off-by: B-Step62 <[email protected]>

B-Step62 force-pushed the eval-judge-2-providers branch from ab3a0e7 to f63c601 Compare November 8, 2024 07:20

minor

bcb2b4a

Signed-off-by: B-Step62 <[email protected]>

B-Step62 changed the title ~~[2/x; WIP] Support non-OAI providers as LLM judges for GenAI metrics.~~ [2/x] Support non-OAI providers as LLM judges for GenAI metrics. Nov 8, 2024

B-Step62 marked this pull request as ready for review November 8, 2024 07:24

github-actions bot added patch-2.17.3 area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Nov 8, 2024

fix test

8a82485

Signed-off-by: B-Step62 <[email protected]>

B-Step62 requested a review from mlflow-automation November 8, 2024 07:56

github-actions bot requested review from BenWilson2, daniellok-db, harupy, serena-ruan, TomeHirata, WeichenXu123 and xq-yin and removed request for mlflow-automation November 8, 2024 07:56

B-Step62 commented Nov 8, 2024

View reviewed changes

TomeHirata reviewed Nov 8, 2024

View reviewed changes

mlflow/metrics/genai/model_utils.py Outdated Show resolved Hide resolved

TomeHirata reviewed Nov 8, 2024

View reviewed changes

mlflow/metrics/genai/model_utils.py Show resolved Hide resolved

TomeHirata reviewed Nov 8, 2024

View reviewed changes

mlflow/metrics/genai/model_utils.py Show resolved Hide resolved

serena-ruan reviewed Nov 11, 2024

View reviewed changes

mlflow/gateway/providers/anthropic.py Outdated Show resolved Hide resolved

harupy added v2.18.0 and removed patch-2.17.3 v2.17.3 labels Nov 11, 2024