LLM-as-Judge Evaluation
The LLM-as-Judge framework leverages the capabilities of a LLM to serve as an automated evaluator, assessing the quality of data with precision and consistency.
Metric Definition
Difficulty (Q)
The definition of difficulty here and the corresponding prompt are from the OpenThoughts paper.
Q_Code_Difficulty: Evaluate the complexity of code-related questions, considering algorithmic challenges, code structure, and implementation difficulty.Q_Math_Difficulty: Evaluate the complexity of math-related questions, considering mathematical concepts, solution steps, and reasoning difficulty.
Relevance (QA)
QA_Relevance: Specifically evaluate whether the answer remains focused on the question, avoiding digressions or irrelevant information.
Clarity (Q & QA)
Q_Clarity: Evaluate whether the question is clearly expressed, free from ambiguity, and easy to comprehend.QA_Clarity: Evaluate whether the answer is articulated with clarity, linguistic fluency, and logical structure.
Coherence (Q & QA)
Q_Coherence: Evaluate the internal logic of the question, ensuring it is coherent and devoid of contradictions.QA_Coherence: Evaluate the relevance of the answer to the question, ensuring logical consistency in arguments and evidence.
Completeness (Q & QA)
Q_Completeness: Evaluate whether the question provides sufficient information for a model to generate a complete answer.QA_Completeness: Evaluate whether the answer fully addresses the user's question, covering all essential aspects.
Complexity (Q & QA)
Q_Complexity: Evaluate the inherent difficulty of the question, considering factors like multiple concepts, multi-step reasoning, or specialized domain knowledge.QA_Complexity: Evaluate the depth of analysis and reasoning in the answer, as well as the complexity of the problem addressed.
Correctness (Q & QA)
Q_Correctness: Evaluate the accuracy of the facts or premises presented in the question.QA_Correctness: Evaluate the accuracy of the information, facts, and logical reasoning in the answer.
Meaningfulness (Q & QA)
Q_Meaningfulness: Evaluate whether the question is meaningful, offering practical value or thought-provoking content.QA_Meaningfulness: Evaluate whether the answer provides valuable, in-depth insights that encourage further reflection.
All
Q/QA_All: Combine all metrics above, score all metrics in one turn, without outputting the reasoning process.
Framework Architecture
main.py: Initiate the evaluation process, loads the configuration, and sets up the evaluator.config.yaml: A configuration file that defines API keys, model parameters, input/output paths, and evaluation metrics.config.py: Use Pydantic to load and validate the configuration file, ensuring its correctness.evaluator.py: The core evaluator that processes data asynchronously and interacts with the LLM API to obtain scores.prompts/: Store the prompt templates required for evaluation, allowing flexible prompt management.validators.py: Validate the responses from the LLM to ensure their format and content are correct.tools/process_scores.py: Provide a post-processing tool to merge score results back into the original data file.output/: A directory that stores evaluation results, error logs, and processed IDs.
YAML Configuration
All configurations for the framework are centralized within a single config.yaml file, ensuring streamlined management and easy adjustments. Below is a detailed example configuration:
# API and Model Configuration
openai:
api_key: "your_api_key" # or use "env:OPENAI_API_KEY" to read from environment variables
base_url: "your_base_url" # or your custom base url
model: "gpt-4.1-nano"
concurrency: 1024 # Number of concurrent requests to the API, 128 ~ 1024 is recommended
timeout: 30 # Timeout for each API request in seconds
retry: 3 # Number of retries for failed API calls
chunk_size: 2048 # Number of items to read from the input file at once
temperature: 0.1 # The sampling temperature, between 0 and 2
top_p: 1.0 # The nucleus sampling probability
# Directory Configuration
input_path: "../data_process/example_input_add_key.jsonl"
output_path: "output"
prompts_dir: "prompts"
id_track_file: "output/scored_ids.txt" # Add this line to enable ID tracking
# Metrics Configuration
# Define which metrics to run for each mode (Q or QA)
# The names must correspond to the prompt files in `prompts_dir`
# e.g., 'Correctness' for QA mode requires a 'QA_Correctness.txt' file.
metrics:
Q: # Metrics for Question-only evaluation
- "All"
# - "Clarity"
# - "Coherence"
# - "Completeness"
# - "Complexity"
# - "Correctness"
# - "Meaningness"
QA: # Metrics for Question-Answer evaluation
- "All"
# - "Clarity"
# - "Coherence"
# - "Completeness"
# - "Complexity"
# - "Correctness"
# - "Meaningness"
# - "Relevance"
Scoring Process
- Configuration Loading: The process begins by loading the configuration from
config.yaml, which includes API credentials, model settings, and evaluation metrics. - Prompt Loading: The evaluator loads the necessary prompt templates from the
prompts/directory based on the metrics specified in the configuration. - Data Reading: Input data is read from JSONL files specified in the
input_dir, reading inchunk_sizeitems at a time to avoid blocking during data reading or excessive memory usage, with each line representing a separate evaluation item. Use the corresponding fields from the data to format{instruction}and{output}in the prompt. - Asynchronous Evaluation: The evaluator processes each item asynchronously, sending requests to the LLM API using the loaded prompts and model settings.
- Retry Mechanism: If an API request fails due to rate limits or other transient errors, the evaluator automatically retries the request up to a specified number of times (
retry), with exponential backoff between attempts to increase the chances of success. To avoid repeated errors, thetemperaturewill also increase exponentially each time, but the maximum value is capped at 1. - Response Validation: The responses from the LLM are validated to ensure they meet the expected format. This includes checking that the output is a valid JSON object, contains all required keys, and that all score values are integers within the range of 1 to 10.
- Result Writing: Successfully evaluated items are written to
_scored.jsonlfiles, while any errors encountered are logged in_errors.jsonlfiles. Semi-streaming writing is adopted: after a chunk of data with sizechunk_sizeis processed, it is written to the file together. - ID Tracking: If enabled, the IDs of evaluated items are tracked to prevent re-evaluation of the same items in future runs.
- Post-Processing: The scores can be merged back into the original data files using the
process_scores.pytool for further analysis or reporting.
Usage
To effectively utilize the LLM-as-Judge framework, please follow the detailed steps outlined below:
1. Configure config.yaml
Begin by setting up your config.yaml file. This involves specifying your API keys, selecting your preferred model, and configuring other necessary parameters as detailed in the YAML configuration section above. This step is crucial for ensuring that the framework operates with the correct settings.
modelshould be models supportingStructured outputslikegpt-4.1-nanoandgpt-4o-minito get json format outputs.- For Q_Code/Math_Diversity metrics, we recommend set the
temperture=1.0following the orginal OpenThoughts paper settings. For other metrics,temperture=0.1is recommend.
2. Prepare Data
- Configure
input_fileinconfig.yaml: This must be the path to a single.jsonlfile to be scored. - Each line in your
jsonlfile should be a JSON object containing the fieldsid,instruction, andoutput.
3. Prepare Prompts
- Navigate to the
prompts/directory and create a prompt file for each evaluation metric as defined in yourconfig.yaml. - The naming convention for these files should be
Mode_MetricName.txt(e.g.,QA_Correctness.txtorQ_Clarity.txt). - Within these prompt files, use
{instruction}and{output}as placeholders for the QA mode.
4. Validator Configurartion (optional)
- In our project, in order to improve efficiency and cost-effectiveness, we score multiple metrics in a single round of dialogue at once (see the prompt for the All metric), and we do not require the output of reasoning or score process.
- For more fine-grained scoring, it is recommended to output the reason before determining the score (as shown in the prompt implementations for other metrics). If this scoring method is to be used, the
llm_as_judge/lvalidator.pyfile needs to be modified to remove the check that enforces the values returned by the LLM in the JSON to be integers between 0 and 10, in order to allow the inclusion of reasoning in the output.
4. Run Evaluation
To execute the evaluation process, run the main script from your terminal using the following command:
python -m llm_as_judge.main
Alternatively, if you want to use a different configuration file:
python -m llm_as_judge.main --config-path /path/to/your/config.yaml
5. Post-Process
- To align the output with other tools and maintain flexibility, you can perform an optional post-processing step. The main evaluation script outputs a
scores.jsonlfile containing only theidandscores. You can useprocess_scores.pyto merge these scores back into your original data file, which may contain other fields.
python tools/process_scores.py --scores_file [PATH_TO_SCORES_FILE] --data_file [PATH_TO_DATA_FILE] --output_file [PATH_TO_OUTPUT_FILE]
The result will be a output file that corresponding placeholder null replaced.
Citation
@article{guha2025openthoughts,
author = {Guha, Etash and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne and others},
journal = {arXiv preprint arXiv:2506.04178},
title = {OpenThoughts: Data Recipes for Reasoning Models},
year = {2025},
}
@inproceedings{liu-etal-2024-alignbench,
address = {Bangkok, Thailand},
author = {Liu, Xiao and Lei, Xuanyu and Wang, Shengyuan and Huang, Yue and Feng, Andrew and Wen, Bosi and Cheng, Jiale and Ke, Pei and Xu, Yifan and Tam, Weng Lam and Zhang, Xiaohan and Sun, Lichao and Gu, Xiaotao and Wang, Hongning and Zhang, Jing and Huang, Minlie and Dong, Yuxiao and Tang, Jie},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
doi = {10.18653/v1/2024.acl-long.624},
month = aug,
pages = {11621--11640},
publisher = {Association for Computational Linguistics},
title = {{A}lign{B}ench: Benchmarking {C}hinese Alignment of Large Language Models},
url = {https://aclanthology.org/2024.acl-long.624/},
year = {2024},
}