Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expts: assess performance of structured outputs #291

Merged
merged 4 commits into from
Jan 29, 2025

Conversation

shreyashankar
Copy link
Collaborator

As per #286 , I have attempted to compare structured outputs and tools. I get the following:

Results Table:
Experiment Results
╭────────────────────────────────────────────────┬───────┬────────────┬───────────┬────────┬───────┬─────────────┬──────────────╮
│ Model │ Doc % │ Approach │ Precision │ Recall │ F1 │ Avg Runtime │ Avg Cost ($) │
├────────────────────────────────────────────────┼───────┼────────────┼───────────┼────────┼───────┼─────────────┼──────────────┤
│ azure/gpt-4o-mini │ 10% │ structured │ 0.869 │ 0.872 │ 0.853 │ 1.100s │ $0.0004 │
│ azure/gpt-4o-mini │ 10% │ tool │ 0.914 │ 0.906 │ 0.891 │ 0.722s │ $0.0004 │
├────────────────────────────────────────────────┼───────┼────────────┼───────────┼────────┼───────┼─────────────┼──────────────┤
│ deepseek/deepseek-chat │ 10% │ structured │ 0.878 │ 0.889 │ 0.877 │ 2.094s │ $0.0003 │
│ deepseek/deepseek-chat │ 10% │ tool │ 0.867 │ 0.856 │ 0.860 │ 2.212s │ $0.0003 │
├────────────────────────────────────────────────┼───────┼────────────┼───────────┼────────┼───────┼─────────────┼──────────────┤
│ lm_studio/hugging-quants/llama-3.2-3b-instruct │ 10% │ structured │ 0.033 │ 0.022 │ 0.027 │ 33.635s │ $0.0000 │
│ lm_studio/hugging-quants/llama-3.2-3b-instruct │ 10% │ tool │ 0.000 │ 0.000 │ 0.000 │ 70.858s │ $0.0000 │
╰────────────────────────────────────────────────┴───────┴────────────┴───────────┴────────┴───────┴─────────────┴──────────────╯

This script structured_outputs.py implements an experimental framework for comparing different approaches to structured information extraction using Large Language Models (LLMs). It specifically tests two methods - structured output using JSON schemas and tool calling using function definitions - across multiple LLM models including GPT-4, DeepSeek, and Llama. The experiment involves injecting fruit and vegetable names into debate transcripts and measuring how well each approach can extract these items.

The implementation features parallel processing for efficiency, comprehensive metrics tracking (precision, recall, F1 score, runtime, and cost), and robust error handling. Results are presented through a rich console table and saved incrementally to prevent data loss. The experiment is designed to help understand the trade-offs between different structured output approaches across various models, providing insights into both accuracy and operational costs.

Note: I don't think I'm using llama 3.2 correctly, as the performance is so low

@shreyashankar
Copy link
Collaborator Author

I will merge this script because I'm sure we will use it in the future

@shreyashankar shreyashankar merged commit b0ded0e into main Jan 29, 2025
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant