expts: assess performance of structured outputs #291

shreyashankar · 2025-01-25T16:45:34Z

As per #286 , I have attempted to compare structured outputs and tools. I get the following:

Results Table:
Experiment Results
╭────────────────────────────────────────────────┬───────┬────────────┬───────────┬────────┬───────┬─────────────┬──────────────╮
│ Model │ Doc % │ Approach │ Precision │ Recall │ F1 │ Avg Runtime │ Avg Cost ($) │
├────────────────────────────────────────────────┼───────┼────────────┼───────────┼────────┼───────┼─────────────┼──────────────┤
│ azure/gpt-4o-mini │ 10% │ structured │ 0.869 │ 0.872 │ 0.853 │ 1.100s │ $0.0004 │
│ azure/gpt-4o-mini │ 10% │ tool │ 0.914 │ 0.906 │ 0.891 │ 0.722s │ $0.0004 │
├────────────────────────────────────────────────┼───────┼────────────┼───────────┼────────┼───────┼─────────────┼──────────────┤
│ deepseek/deepseek-chat │ 10% │ structured │ 0.878 │ 0.889 │ 0.877 │ 2.094s │ $0.0003 │
│ deepseek/deepseek-chat │ 10% │ tool │ 0.867 │ 0.856 │ 0.860 │ 2.212s │ $0.0003 │
├────────────────────────────────────────────────┼───────┼────────────┼───────────┼────────┼───────┼─────────────┼──────────────┤
│ lm_studio/hugging-quants/llama-3.2-3b-instruct │ 10% │ structured │ 0.033 │ 0.022 │ 0.027 │ 33.635s │ $0.0000 │
│ lm_studio/hugging-quants/llama-3.2-3b-instruct │ 10% │ tool │ 0.000 │ 0.000 │ 0.000 │ 70.858s │ $0.0000 │
╰────────────────────────────────────────────────┴───────┴────────────┴───────────┴────────┴───────┴─────────────┴──────────────╯

This script structured_outputs.py implements an experimental framework for comparing different approaches to structured information extraction using Large Language Models (LLMs). It specifically tests two methods - structured output using JSON schemas and tool calling using function definitions - across multiple LLM models including GPT-4, DeepSeek, and Llama. The experiment involves injecting fruit and vegetable names into debate transcripts and measuring how well each approach can extract these items.

The implementation features parallel processing for efficiency, comprehensive metrics tracking (precision, recall, F1 score, runtime, and cost), and robust error handling. Results are presented through a rich console table and saved incrementally to prevent data loss. The experiment is designed to help understand the trade-offs between different structured output approaches across various models, providing insights into both accuracy and operational costs.

Note: I don't think I'm using llama 3.2 correctly, as the performance is so low

shreyashankar · 2025-01-29T17:20:49Z

I will merge this script because I'm sure we will use it in the future

shreyashankar added 4 commits January 29, 2025 09:01

expts: assess structured outputs

a901e6b

expts: assess structured outputs

4c0417b

expts: set temp 0

aa32dff

chore: clean up code

f696777

shreyashankar force-pushed the structuredoutputs branch from 1374928 to f696777 Compare January 29, 2025 17:20

shreyashankar merged commit b0ded0e into main Jan 29, 2025
1 of 5 checks passed

shreyashankar mentioned this pull request Jan 29, 2025

Structured Outputs vs Function Calling - Performance & Accuracy Analysis #282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expts: assess performance of structured outputs #291

expts: assess performance of structured outputs #291

shreyashankar commented Jan 25, 2025

shreyashankar commented Jan 29, 2025

expts: assess performance of structured outputs #291

expts: assess performance of structured outputs #291

Conversation

shreyashankar commented Jan 25, 2025

shreyashankar commented Jan 29, 2025