Releases: open-compass/opencompass
OpenCompass v0.3.1
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.1!
🌟 Highlights
- 🚀 Support pip installation, update Readme and evaluation demo
- 🐛 Fixed various dataset loading issues.
- ⚙️ Enhanced auto-download features for datasets.
🚀 New Features
- 🆕 Introduced support for Ruler datasets.
- 🆕 Enhanced model compatibility.
- 🆕 Improved dataset handling, support auto-download for various datasets
📖 Documentation
- 📚 Updated README to reflect the latest changes.
- 📚 Improved documentation for dataset loading procedures.
🐛 Bug Fixes
- 🐞 Resolved modelscope dataset load issues.
- 🐞 Corrected evaluation scores for the Lawbench dataset.
- 🐞 Fixed dataset bugs for CommonsenseQA and Longbench.
⚙ Enhancements and Refactors
- 🔧 Retained first and last halves of prompts to avoid max_seq_len issues.
- 🔧 Updated Compassbench to v1.3.
- 🔧 Switched to Python runner for single GPU operations.
🎉 Welcome New Contributors
- 🙌 @Yunnglin for fixing modelscope dataset load problem.
- 🙌 @changyeyu for addressing max_seq_len issues with prompt handling.
- 🙌 @seetimee for updates to openai_api.py.
- 🙌 @HariSeldon0 for adding the scicode dataset.
What's Changed
- [Fix] Fix modelscope dataset load problem by @Yunnglin in #1406
- [Fix] the issue where scores are negative in the Lawbench dataset evaluation(#1402) by @yaoyingyy in #1403
- [Doc] Update README by @tonysy in #1404
- Retain first and last halves of prompts to avoid max_seq_len issues by @changyeyu in #1373
- [UPDATE] Compassbench v1.3 by @MaiziXiao in #1396
- [Fix] longbench dataset load fix by @MaiziXiao in #1422
- [Fix] Sub summarizer order fix by @bittersweet1999 in #1426
- [Update] Support auto-download of FOFO/MT-Bench-101 by @tonysy in #1423
- [Bug] Commonsenseqa dataset fix by @MaiziXiao in #1425
- [Feature] Add abbr for rolebench dataset by @xu-song in #1431
- [Feature] Add Ruler datasets by @MaiziXiao in #1310
- [Fix] Fix openai api tiktoken bug for api server by @liushz in #1433
- Update openai_api.py by @seetimee in #1438
- [Feature] Add model support for 'huggingface_above_v4_33' when using '-a' by @liushz in #1430
- Add scicode by @HariSeldon0 in #1417
- [Doc] Update Readme by @MaiziXiao in #1439
- [Fix] Update option postprocess & mathbench language summarizer by @liushz in #1413
- [ci] add commond testcase into daily testcase by @zhulinJulia24 in #1447
- [Feature] Switch to python runner for single GPU by @xu-song in #1308
- [Fix] Update SciCode and Gemma model by @tonysy in #1449
- [Bump] Bump version to 0.3.1 by @tonysy in #1450
Full Changelog: 0.3.0...0.3.1
Thank you for your continued support and contributions to OpenCompass!
OpenCompass v0.3.0
The OpenCompass team is thrilled to announce the release of OpenCompass v0.3.0! This release brings a variety of new features, enhancements, and bug fixes to improve your experience.
🌟 Highlights
- Support for OpenAI ChatCompletion
- Updated Model Support List
- Support Dataset Automatic Download
- Support
pip install opencompass
🚀 New Features
- Support for CompassBench Checklist Evaluation
- PR #1339 by @bittersweet1999
- Adding support for Doubao API
- PR #1218 by @LeavittLang
- Support for ModelScope Datasets
- PR #1289 by @wangxingjun778
📖 Documentation
🐛 Bug Fixes
- Fix Typing and Typo
- Fix Lint Issues
- PR #1334 by @DseidLi
- Fix Summary Error in subjective.py
⚙ Enhancements and Refactors
- Upgrade Default Math
pred_postprocessor
- Fix Path and Folder Updates
- Update Get Data Path for LCBench and HumanEval
🔗 Full Change Logs
- [Fix] Change abbr for arenahard dataset by @bittersweet1999 in #1302
- [Fix] Force register by @Leymore in #1311
- [Fix] add bc for alignbench summarizer by @bittersweet1999 in #1306
- [Fix] update Faq by @bittersweet1999 in #1313
- [Fix] Fix rouge evaluator of rolebench_zh by @xu-song in #1322
- [Doc] Update NeedleBench Docs by @DseidLi in #1330
- [Fix] Fix typing and typo by @xu-song in #1331
- [Fix] Fix lint by @DseidLi in #1334
- [Feature] support compassbench Checklist evaluation by @bittersweet1999 in #1339
- Add compassbench wiki&math part by @liushz in #1342
- Compassbench v1_3 subjective evaluation by @MaiziXiao in #1341
- [Fix] Update path and folder by @tonysy in #1344
- Upgrade default math
pred_postprocessor
by @xu-song in #1340 - commit inference ppl datasets by @Quehry in #1315
- CompassBench subjective summarizer added by @MaiziXiao in #1349
- Fix MathBench Generation Config by @liushz in #1351
- [Update] Update model support list by @bittersweet1999 in #1353
- [Update] update Subeval demo config by @bittersweet1999 in #1358
- [Fix] Fix the summary error in subjective.py by @WenjinW in #1363
- [Fix] Support HF models deployed with an OpenAI-compatible API. by @heya5 in #1352
- update docs by @Leymore in #1318
- [Feature] Make NeedleBench available on HF by @DseidLi in #1364
- 【bug fix】: Remove extra ampersands. by @baymax591 in #1365
- [Fix] minor update wildbench by @kleinzcy in #1335
- Adding support for Doubao API by @LeavittLang in #1218
- [Fix] origin_prompt should be None in llm-compression task by @mqy004 in #1225
- Calm dataset by @pengbo807 in #1287
- Add
en
andzh
groups to longbench summarizer; Fix longbench overall score by @xu-song in #1216 - [Revert] "Calm dataset (#1287)" by @bittersweet1999 in #1366
- Charm by @jxd0712 in #1230
- Support ModelScope datasets by @wangxingjun778 in #1289
- [Feature] Update pip install by @tonysy in #1324
- add support for hf_pulse_7b by @QXY716 in #1255
- [Fix] Update get_data_path for LCBench and HumanEval by @tonysy in #1375
- [Bug] Fix bug in turbomind by @tonysy in #1377
- [Fix] Fix version mismatch of CIBench by @kleinzcy in #1380
- [Fix] Fix InternLM2.5-7B-Chat-1M config by @DseidLi in #1383
- [Feature] Support import configs/models/summarizers from whl by @tonysy in #1376
- Calm dataset by @pengbo807 in #1385
- [Feature] Support OpenAI ChatCompletion by @tonysy in #1389
- [Fix] Fix slurm env by @tonysy in #1392
- [Fix] Fix CaLM import by @tonysy in #1395
- [Bump] Bump version for v0.3.0 by @tonysy in #1398
🎉 Welcome New Contributors
- @MaiziXiao made their first contribution in #1341
- @Quehry made their first contribution in #1315
- @WenjinW made their first contribution in #1363
- @heya5 made their first contribution in #1352
- @LeavittLang made their first contribution in #1218
- @pengbo807 made their first contribution in #1287
- @wangxingjun778 made their first contribution in #1289
- @QXY716 made their first contribution in #1255
Full Changelog: 0.2.6...0.3.0
OpenCompass v0.2.6
The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.6!
🌟 Highlights
- No noteworthy highlights.
🚀 New Features
📖 Documentation
🐛 Bug Fixes
- #1221 Resolve release version installation and import issues
- #1228 Fix pip version issues
- #1282 Update MathBench summarizer & fix cot setting
⚙ Enhancements and Refactors
- #1284 Reorganize subjective eval
🎉 Welcome New Contributors
- @mqy004, @sefira, @Zor-X-L and @baymax591 made their first contributions. Welcome to the OpenCompass community!
🔗 Full Change Logs
- [Fix] fix summarizer by @bittersweet1999 in #1217
- 解决release版本安装后不能导入opencompass.cli.main的问题 by @mqy004 in #1221
- MT-Bench-101 by @sefira in #1215
- [Feature] add dataset Fofo by @bittersweet1999 in #1224
- [Fix] fix pip version by @bittersweet1999 in #1228
- add ",<2.0.0" to "numpy>=1.23.4" in requirements/runtime.txt, as pand… by @Zor-X-L in #1267
- Support wildbench by @kleinzcy in #1266
- Add doc for accelerator function by @liushz in #1252
- flash attn installation in daily testcase by @zhulinJulia24 in #1272
- Update mtbench101.py by @sefira in #1276
- [Sync] Sync with internal codes 2024.06.28 by @Leymore in #1279
- Update MathBench summarizer & fix cot setting by @liushz in #1282
- npu适配 by @baymax591 in #1250
- [ci] update daily testcase by @zhulinJulia24 in #1285
- [Feature] Add InternLM2.5 by @tonysy in #1286
- [Feat] Update owners for issues by @tonysy in #1293
- [Refactor] Reorganize subjective eval by @bittersweet1999 in #1284
- [Doc] quick start swap tabs by @Leymore in #1263
Full Changelog: 0.2.5...0.2.6
OpenCompass v0.2.5
The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.5!
🌟 Highlights
- Simplify the huggingface / vllm / lmdeploy model wrapper.
meta_template
is no longer needed to be hand-crafted in model configs - Introduce evaluation results README in ~20 dataset config folders.
🚀 New Features
- #1065 Add LLaMA-3 Series Configs
- #1048 Add TheoremQA with 5-shot
- #1094 Support Math evaluation via judgemodel
- #1080 Add gpqa prompt from simple_evals, openai
- #1074 Add mmlu prompt from simple_evals, openai
- #1123 Add Qwen1.5 MoE 7b and Mixtral 8x22b model configs
📖 Documentation
- #1053 Update readme
- #1102 Update NeedleInAHaystack Docs
- #1110 Update README.md
- #1205 Remove --no-batch-padding and Use --hf-num-gpus
🐛 Bug Fixes
- #1036 Update setup.py install_requires
- #1051 Fixed the issue caused
- #1043 fix multiround
- #1070 Fix sequential runner
- #1079 Fix Llama-3 meta template
⚙ Enhancements and Refactors
- #1163 enable HuggingFacewithChatTemplate with --accelerator via cli
- #1104 fix prompt template
- #1109 Update performance of common benchmarks
🎉 Welcome New Contributors
- @liuwei130, @IcyFeather233, @VVVenus1212, @binary-husky, @dmitrysarov, @eltociear, @acylam, @lfy79001, @JuhaoLiang1997, @yaoyingyy, and @jxd0712 made their first contributions. Welcome to the OpenCompass community!
🔗 Full Change Logs
- [Fix] Update setup.py install_requires by @Leymore in #1036
- add ChemBench by @liuwei130 in #1032
- [Fix] logger.error -> logger.debug in OpenAI by @Leymore in #1050
- [Sync] Bump version to 0.2.4 by @Leymore in #1052
- [Doc] Update readme by @tonysy in #1053
- [fix]Fixed the issue caused by the repeated loading of VLLM model dur… by @IcyFeather233 in #1051
- [Sync] Sync with internal code 2024.04.19 by @Leymore in #1064
- [Fix] fix multiround by @bittersweet1999 in #1043
- [Feature] Add LLaMA-3 Series Configs by @Leymore in #1065
- [Feature] Add TheoremQA with 5-shot by @Leymore in #1048
- [Fix] Fix sequential runner by @Leymore in #1070
- Add lmdeploy tis python backend model by @ispobock in #1014
- Fix Llama-3 meta template by @liushz in #1079
- Add humaneval prompt from simple_evals, openai by @jingmingzhuo in #1076
- [Feature] Support Math evaluation via judgemodel by @bittersweet1999 in #1094
- [Feature] support arenahard evaluation by @bittersweet1999 in #1096
- Update CIBench by @kleinzcy in #1089
- [Feature] Add gpqa prompt from simple_evals, openai by @Francis-llgg in #1080
- [Deperecate] Remove multi-modal related stuff by @kennymckormick in #1072
- add vllm get_ppl by @VVVenus1212 in #1003
- fix: python path bug by @binary-husky in #1063
- fix output typing, change mutable list to immutable tuple by @dmitrysarov in #989
- [Doc] Update NeedleInAHaystack Docs by @DseidLi in #1102
- [Feature] add support for Flames datasets by @Yggdrasill7D6 in #1093
- adapt to lmdeploy v0.4.0 by @lvhan028 in #1073
- [Fix] fix prompt template by @bittersweet1999 in #1104
- [Fix] Fix Math Evaluation with Judge Model Evaluator & Add README by @liushz in #1103
- [Update] Update performance of common benchmarks by @tonysy in #1109
- [Fix] fix cmb dataset by @bittersweet1999 in #1106
- [Docs] Update README.md by @eltociear in #1110
- [Feature] Adding support for LLM Compression Evaluation by @acylam in #1108
- [Fix] remove redundant pre-commit check by @Leymore in #891
- fix LightllmApi workers bug by @helloyongyang in #1113
- [Feature] Add mmlu prompt from simple_evals, openai by @Leymore in #1074
- [Feature] update drop dataset from openai simple eval by @kleinzcy in #1092
- add mgsm datasets by @Yggdrasill7D6 in #1081
- [Fix] Fix AGIEval chinese sets by @xu-song in #972
- S3Eval Dataset by @lfy79001 in #916
- [Feature] Add AceGPT-MMLUArabic benchmark by @JuhaoLiang1997 in #1099
- [Fix] fix links by @bittersweet1999 in #1120
- [Fix] Fix NeedleBench Summarizer Typo by @DseidLi in #1125
- [Feature] Add Qwen1.5 MoE 7b and Mixtral 8x22b model configs by @acylam in #1123
- [Sync] Update accelerator by @Leymore in #1122
- [Fix] fix alpacaeval while add caching path by @bittersweet1999 in #1139
- [Fix] fix multiround by @bittersweet1999 in #1146
- [Fix] Fix Needlebench Summarizer by @DseidLi in #1143
- [Feature] Add huggingface apply_chat_template by @Leymore in #1098
- [Feat] Support dataset_suffix check for mixed configs by @xu-song in #973
- [Format] Add some config lints by @Leymore in #892
- [Sync] Sync with internal codes 2024.05.14 by @Leymore in #1156
- [Fix] fix arenahard summarizer by @bittersweet1999 in #1154
- [Fix] use ProcessPoolExecutor during mbpp eval by @Leymore in #1159
- [Fix] Update stop_words in huggingface_above_v4_33 by @Leymore in #1160
- Update accelerator by @liushz in #1152
- [Feat] enable HuggingFacewithChatTemplate with --accelerator via cli by @Leymore in #1163
- update test workflow by @zhulinJulia24 in #1167
- [Sync] Sync with internal codes 2024.05.17 by @Leymore in #1171
- add dependency in daily test workflow by @zhulinJulia24 in #1173
- [Sync] Sync with internal codes 2024.05.21.1 by @Leymore in #1175
- Update MathBench by @liushz in #1176
- [Fix] fix template by @bittersweet1999 in #1178
- Fix a bug in drop_gen.py by @kleinzcy in #1191
- [Fix] temporary files using tempfile by @yaoyingyy in #1186
- [Fix] add support for lmdeploy api judge by @bittersweet1999 in #1193
- [Fix] fix length by @bittersweet1999 in #1180
- support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks by @jxd0712 in #1190
- [Feat] Update charm summary by @Leymore in #1194
- Update accelerator by @liushz in #1195
- [Sync] S...
OpenCompass v0.2.5.rc1
[Feature] Add lmdeploy tis python backend model (#1014) * add lmdeploy tis python backend model * fix pr check * update
OpenCompass v0.2.4
The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.4!
🌟 Highlights
- Enhanced support for multiple datasets including QuALITY, APPS and TACO.
- Introducing multi-model judging for subjective test.
- Bug fixes and improvements in configurations and documentation.
🚀 New Features
🌐 General
- Feat #963 - Support for APPS dataset.
- Feature #976 - Add the implementation of QuALITY datasets.
- Feature #984 - Add support for setting prediction paths.
- Feature #1006 - Support alpacaeval_v2.
- Feature #1016 - Add multi-model judge.
- Feature #1019 - Add ATC Choice Version.
📖 Documentation
- Updates docs #1015 - General documentation updates and improvements.
🐛 Bug Fixes
- Fix #964 - Fix the config's name of deepseek-coder.
- Fix #890 - Update links and link checkers.
- Fix #977 - Fix a bug in internlm2 series configs.
- Fix #975 - Fix documentation issues.
- Fix #992 - Fix running issues in turbomind_tis.
- Fix #994 - Change status to list in base.py.
- Fix #995, Fix #1020 - Quick fixes and refactors for configs.
⚙ Enhancements and Refactors
- Modify requirements/runtime.txt #983 - Update numpy version requirement.
- Update Needlebench and configs #986 - Enhancements in Needlebench configurations.
- Simplify needlebench summarizer #1024 - Streamline Needlebench summarizer for better efficiency.
🎉 Welcome New Contributors
- @seanzhang-zhichen, @kleinzcy, @ispobock, @Chaseldot, and @Y0oMu made their first contributions. Welcome to the OpenCompass community!
🔗 Full Change Logs
[Fix] fix the config's name of deepseek-coder by @jingmingzhuo in #964
[Fix] Update links and link checkers by @Leymore in #890
[Feat] support apps by @Connor-Shen in #963
fix doc problem by @seanzhang-zhichen in #975
[Fix] fix a bug in internlm2 series configs by @jingmingzhuo in #977
[Feature] Add the implement of QuALITY datasets by @jingmingzhuo in #976
modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4 by @kleinzcy in #983
[Feature] add support for set prediction path by @bittersweet1999 in #984
[Feat] Support TACO by @Connor-Shen in #966
[Feature] update apps by @Connor-Shen in #985
[Fix] update apps/taco by @Connor-Shen in #988
[Feature] add one script for subjective by @bittersweet1999 in #993
Fix running issues in turbomind_tis by @ispobock in #992
[Fix] base.py change status into list by @Chaseldot in #994
[Fix] quick fix for configs by @bittersweet1999 in #995
[Feature] update needlebench and configs by @DseidLi in #986
[Feature] support alpacaeval_v2 by @bittersweet1999 in #1006
updates docs by @Y0oMu in #1015
[Feature] Add multi-model judge and fix some problems by @bittersweet1999 in #1016
[Fix] Refactor Needlebench Configs for CLI Testing Support by @DseidLi in #1020
[Feature] Add ATC Choice Version by @DseidLi in #1019
[Fix] Simplify needlebench summarizer by @DseidLi in #1024
For a detailed overview of all changes, check out our Full Changelog.
OpenCompass v0.2.4.rc1
Provide with more parsed datasets:
OpenCompassData-complete-20240325.zip
Important updates compared to previous version are as follow:
Subjective: Add MTBench
LongText: Support Needle-In-Haystack Test Dataset
Code: Update generation version of CIBench
OpenCompass v0.2.3
The OpenCompass team is thrilled to announce the release of OpenCompass v0.2.3! This version is packed with new features, crucial fixes, and documentation updates to improve your experience. We're continuously working to enhance OpenCompass, making it more robust and versatile for all users.
🌟 Highlights:
- Enhanced Model Support: Introduction of new models and configurations, including support for the LightllmApi, lmdeploy pytorch engine, and more.
- New Datasets and Benchmarks: Expanding our dataset repository with additions like OpenFinData, lveval benchmark, and an upgrade to Needlebench.
- Documentation and Sync Improvements: Updated dataset pack URLs, fixed documentation errors, and synchronized with internal codes for consistency.
Explore the key updates in this release:
🌟 New Features:
-
📦 Dataset and Benchmark Expansion:
-
🛠 Model and API Integrations:
- Enhanced functionality with support for LightllmApi input_format and prompt templates, alongside the introduction of get_ppl for TurbomindModel (#888, #878).
- New model configurations added, including support for gemini and deepseek-coder, further broadening the tools available for users (#931, #943).
-
📖 Documentation and Sync Updates:
🐛 Bug Fixes:
- Addressed various configuration and template issues to ensure smoother operation across different models and benchmarks (#894, #893).
- Fixed issues related to IFEval, including type hints and config bugs, enhancing evaluation accuracy and functionality (#906, #915).
🎉 Welcome New Contributors:
- We're delighted to welcome our new contributors: @xu-song, @x22x22, @yuantao2108, and @fanqiNO1. Your contributions are invaluable to the growth of OpenCompass!
🔗 Full Changelog
- Support LightllmApi input_format by @helloyongyang in #888
- [Fix] rename qwen2-beta -> qwen1.5 by @Leymore in #894
- [Fix] Fix chatglm2 config by @Leymore in #893
- [Fix] Fix moss template config by @xu-song in #897
- Support lmdeploy pytorch engine by @RunningLeon in #875
- [Fix] fix ifeval by @bittersweet1999 in #906
- [Fix] fix ifeval by @jingmingzhuo in #909
- [Fix] Fix type hint in IFEval for python<=3.8 by @Leymore in #915
- [Docs] Update dataset pack urls by @Leymore in #922
- [Sync] update github blacklist by @Leymore in #929
- [Feature] add support for gemini by @bittersweet1999 in #931
- [Feature] Support OpenFinData by @Skyfall-xzz in #896
- [Fix]Fixed the problem of never entering task.run() mode in local scheduling mode. by @x22x22 in #930
- Add VLLM Model Configs by @DseidLi in #938
- [Feature] Upgrade the needle-in-a-haystack experiment to Needlebench by @DseidLi in #913
- [Feature] add lveval benchmark by @yuantao2108 in #914
- [Sync] Sync with internal 2023.03.04 by @Leymore in #941
- [Fix] fix a bug of humanevalplus config by @jingmingzhuo in #944
- [Feature] Add configs of deepseek-coder by @jingmingzhuo in #943
- Fix FinanceIQ_datasets import error by @xu-song in #939
- [Docs] Update rank link in README by @fanqiNO1 in #911
- Support get_ppl for TurbomindModel by @RunningLeon in #878
- Support prompt template for LightllmApi. Update LightllmApi token bucket. by @helloyongyang in #945
- Fix LightllmApi ppl test by @helloyongyang in #951
- [Fix] Chinese version of ReadTheDoc by @tonysy in #947
- [fix] add different temp for different question in mtbench by @bittersweet1999 in #954
- [Sync] Sync with internal codes 2024.03.08 by @Leymore in #953
- [Docs] Update README by @tonysy in #956
- [Misc] Update owners by @Leymore in #961
- [Fix] Use logger.error on failure by @Leymore in #960
- [Sync] Bump version 0.2.3 by @Leymore in #957
For a detailed overview of all changes, check out our Full Changelog.
OpenCompass v0.2.2
Welcome to OpenCompass v0.2.2, a release brimming with new features, essential fixes, and significant improvements across the board. With a focus on enhancing functionality and expanding dataset support, this update underscores our commitment to providing a robust platform for our users.
🌟 Highlights:
- Broadened Dataset Support: Introduction of diverse datasets like
T-Eval
,CIBench
,IFEval
, andNPHardEval
, and more, broadening the horizons for research and evaluation. - API Integrations and Updates: New support for APIs like Nanbeige and updates to existing ones such as Zhipu and Sensetime, enhancing model interaction capabilities.
- Dataset Collection Release: Integrated dataset collection is availabe in 0.2.2.rc1. Dataset used in OpenCompass 2.0 leaderboard is NOT included in this collection.
Dive into what's new and improved:
🌟 New Features:
-
📦 Datasets Expansion:
-
🛠 API and Model Enhancements:
-
📖 Documentation and CI Enhancements:
🐛 Bug Fixes:
- Various fixes have been applied to address issues across datasets, evaluators, and configurations, ensuring a smoother experience for all users (#787, #788, #789).
🎉 Welcome New Contributors:
- We're excited to welcome our new contributors: @notoschord, @zhulinJulia24, @QipengGuo, @RangiLyu, @del-zhenwu, and @hailsham. Thank you for your valuable contributions!
🔗 Full Changelog
- Dev by @xmshi-trio in #779
- [Fix] add temperature in alles by @bittersweet1999 in #787
- [Feature] Add support of Nanbeige API by @notoschord in #786
- [Fix] Update gsm8k agent prompt by @tonysy in #788
- [Fix] hot fix for requirements by @yingfhu in #789
- [Feature] Add configs for creationbench by @bittersweet1999 in #791
- Add test runner, one case, daily and pr trigger by @zhulinJulia24 in #751
- [Fix] reorganize subject files by @bittersweet1999 in #801
- Update evaluate turbomind by @RunningLeon in #804
- Added support for multi-needle testing in needle-in-a-haystack test by @DseidLi in #802
- [Sync] Add InternLM2 Keyset Evaluation Demo by @Leymore in #807
- [Doc] Update news by @Leymore in #810
- Fix turbomind and update docs by @RunningLeon in #808
- fix configs template for yi_6b_200k model by @DseidLi in #815
- Test runner update - split step, change schedule time and disable hf cache by @zhulinJulia24 in #814
- Add LightllmApi KeyError log & Update doc by @helloyongyang in #816
- Update cdme config and evaluator by @QipengGuo in #812
- Update hf_internlm2_chat template by @RangiLyu in #823
- [Feature] add Compass arena by @bittersweet1999 in #828
- [Fix] fix strings by @bittersweet1999 in #833
- [Feature] Add IFEval by @jingmingzhuo in #813
- [Feature] add mtbench by @bittersweet1999 in #829
- [Feature] Update API implementation by @tonysy in #834
- [Doc] Update FAQ & Contribution Guide by @Leymore in #830
- add fail notify by @zhulinJulia24 in #836
- [Sync] Updata dataset cfg for InternMath by @Leymore in #837
- [Fix] fix corev2 by @bittersweet1999 in #838
- [Feat] minor update agent related by @yingfhu in #839
- [Update] Update Sensetime API by @tonysy in #844
- [Fix] Update MedBench by @xmshi-trio in #845
- [Fix] Fix acc of IFEval by @jingmingzhuo in #849
- [Fix] Update Zhipu API and Fix issue min_out_len issue of API models by @tonysy in #847
- Create link-check.yml by @del-zhenwu in #853
- Update runtime.txt to fix rouge_chinese bugs. by @QipengGuo in #803
- [Fix] fix compass arena by @bittersweet1999 in #854
- add end_str for turbomind by @RunningLeon in #859
- add daily test case by @zhulinJulia24 in #864
- [Feature] support alpacaeval by @bittersweet1999 in #809
- [Fix] Fix error in gsm8k evaluator by @yanyc428 in #782
- [CI] Update github workflow image by @Leymore in #874
- Update daily test by @zhulinJulia24 in #871
- support NPHardEval by @Skyfall-xzz in #835
- [Fix] add do sample demo for subjective dataset by @bittersweet1999 in #873
- [Sync] Sync with internal codes 2024.02.05 by @Leymore in #876
- [Fix] hotfix for mtbench by @bittersweet1999 in #877
- fix lawbench 2-1 f0.5 score calculation bug by @Yggdrasill7D6 in #795
- [feat] support multipl-e by @Connor-Shen in #846
- fix bug of gsm8k_postprocess by @hailsham in #863
- [Feature] add global retriever config by @hailsham in #842
For a full list of updates, visit our Full Changelog.
Thank you to every contributor, old and new. Your dedication is shaping OpenCompass into a more robust and versatile tool. 🙌 🎉
Remember to star 🌟 our GitHub repository if OpenCompass aids your research and development! Your support and feedback are crucial for our continuous improvement.
OpenCompass v0.2.2.rc1
Provide with more parsed datasets:
OpenCompassData-core-20240207.zip
OpenCompassData-complete-20240207.zip
Important updates compared to previous version are as follow:
- Subjective: Add AlignBench, MTBench
- Agent: Add T-Eval
- Medicine: Add MedBench
- Code: Add HumanEval-X, DS-1000
- Finance: Add FinanceIQ
- Law: Update LawBench Evaluation Assets
OpenCompassData-core-20240207.zip
AGIEval | ARC | BBH | ceval | CLUE | cmmlu |
commonsenseqa | drop | FewCLUE | flores_first100 | GAOKAO-BENCH | gsm8k |
hellaswag | humaneval | lambada | LCSTS | math | mbpp |
mmlu | nq | openbookqa | piqa | race | siqa |
strategyqa | summedits | SuperGLUE | TheoremQA | triviaqa | tydiqa |
winogrande | xstory_cloze | Xsum |
OpenCompassData-complete-20240207.zip
AGIEval | anli | ARC | BBH | CDME | ceval |
cibench_dataset | cleva | clozeTest-maxmin | CLUE | CMB | cmmlu |
commonsenseqa | commonsenseqa_cn | crowspairs_cn | drop | ds1000_data | FewCLUE |
FinanceIQ | flores200_dataset | flores_first100 | FunctionalMT | game24 | GAOKAO-BENCH |
gpqa | gsm8k | hellaswag | humaneval | humaneval_cn | humaneval_multipl-e |
humanevalx | HungarianExamMath | InfiniteBench | lambada | lanQ | lawbench |
LCSTS | math | math401 | mbpp | mbpp_cn | mbpp_plus |
MedBench | mmlu | MNIST | NPHardEval | nq | nq_cn |
nq-open | openbookqa | piqa | py150 | qabench | race |
scibench | siqa | SQuAD2.0 | strategyqa | alignment_bench | mtbench |
summedits | SuperGLUE | svamp | teval | TheoremQA | triviaqa |
tydiqa | winogrande | xiezhi | xlsum | xstory_cloze | Xsum |