This application is a tool to assist with ranking large language models (LLMs) for use in the CHIM Skyrim mod. It allows you to:
- Use exported context logs from CHIM to generate responses from an LLM
- Get a different LLM to score those responses
- View the average scores in a table
See the CHIM discord for more details.
Run the application using the chim_scoring_assistant.exe
file. Refer to the application's "Instructions" tab for more detailed instructions.
The application remembers your LLM settings and the prompts file when you click the "Generate" button. This means you shouldn't need to re-enter this information the next time you use the application.
Only the average scores are displayed within the application. However, all LLM responses and score explanations are saved in the "models" folder for your reference.
LLM responses are evaluated on five criteria, each on a scale of 1 to 5. The specific prompts used for scoring are defined in the scorePrompt.txt
file.
The five criteria are:
- Responsiveness: How well the LLM's response addressed the player's prompt.
- Roleplaying: How effectively the LLM roleplayed as its assigned character.
- Quality: Overall quality of the response, considering humor, wit, depth, originality, and appropriateness.
- Action: How suitable the selected action was in the context of the scene.
- Target: How well the AI selected a target for its action (if applicable).
The application makes some manual adjustments to the scores:
- All scores are overridden with a 1 if the response is not valid JSON (as this is unusable in CHIM).
- The AI judges the suitability of the selected action, but the application overrides the score with a 1 if it's an unrecognized action.
- The target score is calculated without AI input. If the selected action requires a target (like "Attack") but has no target specified, the target score is set to 1. Slightly incorrect targets get a 4, while correctly set targets get a 5.
- OpenAI: When using OpenAI models, use
max_completion_tokens
instead ofmax_tokens
. Leave themin_p
field blank, as it is not supported by the OpenAI API and will cause errors. - OpenRouter: When selecting specific Providers on OpenRouter, check that provider's documentation on OpenRouter for the model. Not all fields are universally supported across all providers. For example, when using the Fireworks AI provider through OpenRouter, the
json_schema
parameter is usable, but themin_p
parameter must be left blank. - LM Studio (Local): When using a locally running LM Studio instance, a model name is still required in the application. This name is used for organizing and saving results in the correct folders within the application, even though LM Studio itself doesn't strictly require a model name in the API request. The API key field can be left blank when using a local LM Studio instance.
- The
scorePrompt.txt
file may need some refinement. - The prompts file must be enclosed in single quotes rather than double quotes. This is to avoid some problems with incorrect json in the CHIM logs.
- The application does not currently support actions from the CHIM plugin "Min AI". These actions will result in an action score of 1 as unrecognized.
- Error handling could be improved. Currently, the results of a run are lost if an error occurs. Ideally, partial results would be logged to the "Results" tab or saved to a file.
- An optional configurable wait time between requests to the LLM could be added to mitigate rate limits.
Requires Nodejs.
- Clone the repository.
- Open a command prompt in the cloned folder with admin rights and type
corepack enable
to enable Yarn. - Type
yarn
and thenyarn start
to launch the app locally. - Or type
yarn
and thenyarn make
to build the exe and zip files used in releases.