Releases · open-compass/opencompass

08 Jan 14:57

bittersweet1999

0.2.1

a74e4c1

OpenCompass v0.2.1

We're thrilled to announce OpenCompass v0.2.1, loaded with new datasets, features, and vital fixes. This release is a testament to our ongoing commitment to enhancing user experience and broadening research capabilities.

🌟 Highlights:

Add Agent and Code datasets: Diverse new datasets like GPQA, mastermath2024v1, and more, significantly expanding the scope of OpenCompass.
Support Different JudgeLLM Subjective Evaluation: Providing more choice when choose judgellms.
Support Needle in Haystack: Support Needle in Haystack for longtext evaluation.
Add VLLM Evaluation: We support VLLM inference and evaluation.

Here's what's new:

🚀 New Features:

📦 Dataset Expansion:
- Added rwkv-5-3b model (#666)
- Integration of diverse datasets including GPQA, Creationbench, and more.
- Support for new datasets like mastermath2024v1, mbpp_plus, and sanitized_mbpp (#744, #770, #745)
🛠 Functional Enhancements:
- Subjective evaluation improvements (#692, #724)
- Updated python action, slurm, and docker docs (#694, #718)
- Turbomind API support and Qwen API integration (#693, #735)
📖 Documentation Updates:
- Updated contamination, alignmentbench, and other docs for better clarity (#698, #707)
- Fixed dead links and typos in various documents (#455, #773, #774)

🐛 Bug Fixes:

Addressed various issues including those in alignmentbench, configs, and postprocess scripts.
Fixed bugs concerning subjective evaluation and EOS string detection.
Quick fixes for improved performance and reliability.

🎉 Welcome New Contributors:

A warm welcome to our first-time contributors:
- @BBuf, @DseidLi, @Skyfall-xzz, @RunningLeon, @zehuichen123, @AllentDan, @Connor-Shen, @Francis-llgg, @hzhwcmhf, @ChrisLiu6, @yanyc428, @tpoisonooo, @jiangjin1999

🔗 Full Changelog

add rwkv-5-3b model by @BBuf in #666
[Feature] Add double order of subjective evaluation and removing duplicated response among two models by @bittersweet1999 in #692
[Feat] update python action and slurm by @yingfhu in #694
[Doc] Update contamination docs by @Leymore in #698
alignmentbench infer and judge by @bittersweet1999 in #697
[Fix] Update alignmentbench by @tonysy in #704
removed redundant code in GSM8KDataset.load method. by @DseidLi in #700
[Fix] fix a bug on configs/eval_mixtral_8x7b.py by @jingmingzhuo in #706
[Doc] Update Doc for Alignbench by @tonysy in #707
[Fix] minor fix openai by @yingfhu in #711
Add Judgellms by @bittersweet1999 in #710
[Feat] Update math/agent by @yingfhu in #716
[Docs] update docker docs by @yingfhu in #718
[Fix] Quick fix for max_out_len in subjective evaluation by @bittersweet1999 in #719
[Feature] Support the use of humaneval_plus. by @jingmingzhuo in #720
[Feature] Add reasonbench dataset by @Skyfall-xzz in #577
[Feature] Add abbr for judgemodel in subjective evaluation by @bittersweet1999 in #724
Update configs for evaluating chat models like qwen, baichuan, llama2 using turbomind backend by @RunningLeon in #721
[News] add news for T-Eval by @zehuichen123 in #727
Add NeedleInAHaystack Test Support by @DseidLi in #714
[Fix] Fixed abbr erro of subjective alignbench and size partition by @bittersweet1999 in #730
add turbomind restful api support by @AllentDan in #693
[Fix] Update merge script for non-split settting by @tonysy in #733
[Sync] Sync with internal codes by @Leymore in #734
[Feature] Add InfiniteBench by @philipwangOvO in #739
Update LightllmApi and Fix mmlu bug by @helloyongyang in #738
[Feature] Add other judgelm prompts for Alignbench by @bittersweet1999 in #731
[Feat] support sanitized mbpp dataset by @yingfhu in #745
[Fix] SubSizePartition fix by @bittersweet1999 in #746
add chinese version of humaneval, mbpp by @Connor-Shen in #743
[Fix] fix erro in configs by @bittersweet1999 in #750
[Feature] Add Creationbench Dataset by @bittersweet1999 in #753
[Feat] update code config by @yingfhu in #749
update plot function in tools_needleinahaystack.py by @DseidLi in #747
[Feature] Add new dataset mastermath2024v1 by @Francis-llgg in #744
[Feature] Add GPQA Dataset by @Francis-llgg in #729
change NeedleInAHaystackDataset to dynamic loading by @DseidLi in #754
[Feature] Add support of Qwen API by @hzhwcmhf in #735
[Feature] Support LLaMA2-Accessory by @ChrisLiu6 in #732
[Fix] Fix small bug in alignbench by @bittersweet1999 in #764
[Feature] Add multi_round dataset evaluation by @bittersweet1999 in #766
[Feature] add subject ir dataset by @bittersweet1999 in #755
[Update] Update introduction of CompassBench-2024-Q1 by @tonysy in #769
[Fix] quick fix for postprocess by @bittersweet1999 in #771
Support Mbpp_plus dataset by @Connor-Shen in #770
[Fix] fix typos in drop prompt by @yanyc428 in #773
typo(installation.md): fix unzip commands by @tpoisonooo in #774
Contamination analysis for MMLU, Hellaswag, and ARC_c by @liyucheng09 in #699
[Docs] Update contamination docs by @Leymore in #775
[Feature] _batch_generate function, add the MultiTokenEOSCriteria by @jiangjin1999 in #772
[Sync] Sync with internal codes 2023.01.08 by @Leymore in #777

For a full list of updates, visit our Full Changelog.

Thank you to every contributor, old and new. Your dedication is shaping OpenCompass into a more robust and versatile tool. 🙌 🎉

Remember to star 🌟 our GitHub repository if OpenCompass aids your research and development! Your support and feedback are crucial for our continuous improvement.

Contributors

hzhwcmhf, tpoisonooo, and 19 other contributors

Assets 2

12 Dec 06:42

yingfhu

0.2.0

4780b39

OpenCompass v0.2.0

🌟 Highlights

🛠 Data Contamination Analysis: A novel feature for analyzing and ensuring the integrity of dataset inputs.
🧠 Enhanced Subjective Evaluation: Implementation of a new subjective judgement system, providing more nuanced and accurate evaluations.
🚀 Chat Style Inferencer Support: Introduction of a new chat style inferencer, enhancing interactive capabilities.
🌐 Multilingual Features: Expansion to support Chinese versions of commonsenseqa, crowspairs, and nq datasets.
📊 New Datasets Integration: Addition of wikibench, rolebench, and updated versions of gsm8k and MathBench datasets for broader research applications.
🛠 Enhancements and Bug Fixes: Numerous improvements including a new subjective judgement system and updates in MathBench CodeInterpreter.
📝 Documentation and API Updates: Comprehensive updates to README and API interfaces for better user guidance and experience.

🚀 New Features & Enhancements

Support for chat style inferencer, offering a more dynamic interaction model (#643).
Addition of Chinese versions for key datasets: commonsenseqa, crowspairs, and nq (#144).
Introduction of the wikibench dataset, providing a new benchmark for knowledge-based tasks (#655).
Updated gsm8k and MathBench configurations for enhanced performance and accuracy (#652, #657).
Addition of rolebench dataset, expanding the range of evaluative scenarios (#633).
Implementation of new subjective judgement criteria for improved assessment accuracy (#660).
Integration of advanced models like qwen-1.8b/72b and deepseek-7b/67b in the platform's configuration (#672).
Launch of Data Contamination Analysis as a new feature, enhancing data integrity checks (#639).

🛠 Improvements & Fixes

Removal of colossalai dependency to streamline operations (#645).
Resolution of various bugs including hellaswag_ppl_47bff9 and standard deviation summarizer issues (#648, #675).
Update and fix of the MathBench CodeInterpreter and related bugs (#657).
Enhancement of API interface for improved functionality and user experience (#681).

📚 Documentation Updates

Updated README for clearer guidance and information (#682).
Documentation and docstring updates for accuracy and comprehensiveness (#684).

🎊 New Contributors

A warm welcome to new contributors @rolellm, @liyucheng09, and @xmshi-trio. Your contributions have significantly enriched OpenCompass!

🔗 Full Changelog

[Fix] remove colossalai dependency by @yingfhu in #645
[Fix] Fix hellaswag_ppl_47bff9 by @Leymore in #648
[Feature] Support chat style inferencer. by @mzr1996 in #643
[Feature] Add Chinese version: commonsenseqa, crowspairs and nq by @liushz in #144
[Feature] Add wikibench dataset by @liushz in #655
[Feat] update gsm8k and math agent config by @yingfhu in #652
[Feature] Update MathBench CodeInterpreter & fix MathBench Bug by @liushz in #657
added rolebench dataset. by @rolellm in #633
New subjective judgement by @bittersweet1999 in #660
[Feature] Add qwen-1.8b/72b and deepseek-7b/67b configs by @Leymore in #672
Add Data Contamination Analysis [New Feature] by @liyucheng09 in #639
[Fix] fix bug on standart_deviation summarizer by @jingmingzhuo in #675
update medbench by @xmshi-trio in #678
[Enhancement] Update API Interface by @tonysy in #681
[Doc] Update README by @kennymckormick in #682
[Feat] support pr merge test ci by @yingfhu in #669
[Feature] enhance the ability of humaneval_postprocess by @jingmingzhuo in #676
[Sync] Update codes by @yingfhu in #683
[Docs] fix docstring by @yingfhu in #684
new version of subject by @bittersweet1999 in #680
fixed small problem of new version subject evaluation by @bittersweet1999 in #686
[Sync] bump version to 0.2.0 by @yingfhu in #690

Explore the detailed changes and contributions in the full changelog: OpenCompass Changelog.

Thank you to all contributors for your hard work and dedication. OpenCompass v0.2.0 marks another step forward in our journey, bringing enhanced features and capabilities to the community. Let's continue to innovate and expand the horizons of OpenCompass! 🎉🌐💡

Contributors

tonysy, Leymore, and 9 other contributors

Assets 2

28 Nov 03:53

Leymore

0.1.9

e20d654

OpenCompass v0.1.9

🌟 Highlights

🚀 New API Integrations: A leap forward with the addition of multiple new APIs, including Baidu, Moonshot, Sensetime, and more, broadening the scope and capabilities of OpenCompass.
🔵 Circular Evaluation Feature: Introducing Circular Eval, an enhancement for comprehensive and dynamic evaluations within the platform.
🤖 Turbomind Inference Integration: Integration of Turbomind inference through its RPC API, enhancing the platform's inferencing capabilities.

🚀 New Features & Enhancements

Model & API Development: Explore new capabilities with DataCanvas Alaya LM, Lightllm API, 360API, and enhanced Turbomind Python API integration (#612, #613, #601, #484).
Circular Evaluation Implementation: Elevate your evaluation methods with the newly added Circular Eval feature, offering a more nuanced and detailed analysis capability (#610).
Rich Dataset Additions: Enrich your research with new datasets - FinanceIQ, SVAMP, GSM_Hard, and updated Mathbench for diverse applications (#596, #604, #619, #580, #607).

🛠 Improvements & Fixes

Subjective Evaluation Bug Fixes: Improved accuracy in subjective evaluations (#589).
Dataset and Feature Fixes: Resolving issues in CMB dataset, various feature enhancements, and fixes (#587, #592, #615, #632).

📚 Documentation Updates

README & FAQ Enhancements: Updated for better clarity and assistance (#582, #622, #628, #629).
Typo and Spelling Corrections: Ensuring accuracy and professionalism in documentation (#594, #637).

🎊 New Contributors

Welcoming new contributors to the OpenCompass family!

@rahidzeynal, @Sniper970119, @ZhangRaymond, @HunterKruger, @helloyongyang, and @Yggdrasill7D6. Your contributions are greatly appreciated!

What's Changed

Add author as: author='OpenCompass Contributors' by @rahidzeynal in #578
[Doc] Update README by @tonysy in #582
[Feature] Update mathbench by @tonysy in #580
Fix bugs in subjective evaluation by @frankweijue in #589
[Fix] fix cmb dataset by @Leymore in #587
[Fix] change save_every defaults to 1 by @yingfhu in #592
update word spell by @Sniper970119 in #594
Add FinanceIQ dataset by @ZhangRaymond in #596
[Feat] support humaneval and mbpp pass@k by @yingfhu in #598
[Feature] Add multi-prompt generation demo by @jingmingzhuo in #568
Mathbench update postprocess by @liushz in #600
[Feature] Add arithmetic to mathbench by @liushz in #607
Add support for DataCanvas Alaya LM by @HunterKruger in #612
[Feature] Support Lightllm api by @helloyongyang in #613
[Feature] Support 360API and FixKRetriever for CSQA dataset by @tonysy in #601
Integrate turbomind python api by @lvhan028 in #484
[Bug] Update api with generation_kargs by @tonysy in #614
[Fix] Fix gen inferencer by @Leymore in #615
[Docs] update ds1000 code eval docs by @yingfhu in #618
[Feature] Add SVAMP dataset by @liushz in #604
[Feature] support download from modelscope by @KevinNuNu in #534
[Doc] Update README and requirements. by @tonysy in #622
[Sync] Fix cmnli, fix vicuna meta template, fix longbench postprocess and other minor fixes by @Leymore in #625
[API] Update API by @tonysy in #624
[Feature] Add circular eval by @Leymore in #610
[Doc] Update FAQ by @Leymore in #628
[Doc] Update README by @tonysy in #629
[Bug] fix icl eval with nested list by @yingfhu in #632
Fix LightllmAPI list bug by @helloyongyang in #635
fix typo in README by @Yggdrasill7D6 in #637
[Sync] update codes by @Leymore in #641
[Feature] Add GSM_Hard dataset by @liushz in #619
[Feat] support zhipu post process by @yingfhu in #642
[Sync] Bump version to 0.1.9 by @Leymore in #644

Explore the detailed changes in the full changelog.

Thank you to all the contributors for this release. Your dedication and hard work continue to enhance OpenCompass, making it an ever-evolving and dynamic tool for the community. Let's dive into the new possibilities with OpenCompass v0.1.9! 🎉🧮💻

Contributors

lvhan028, tonysy, and 12 other contributors

Assets 2

13 Nov 08:43

Leymore

0.1.8

1ea88d5

OpenCompass v0.1.8

🔥 Highlights

🌐 New Dataset Integrations: Expanding our dataset collection with Tabmwp, py150, maxmin, and more.
💡 Compatibility and API Support: Enhancements with MiniGPT-4 and MiniMax API, and support for Xunfei API.
🛠️ Local Environment and Debugging Improvements: Streamlined local debugging and usage of datasets from local paths.

🚀 New Features & Enhancements

Datasets Galore: Unleash the power of new datasets including Tabmwp, py150, maxmin, and updates to existing ones like Mathbench for broader research scope (#505, #546, #562).
MiniGPT-4 & MiniMax API Compatibility: Stay up-to-date with the latest versions and extended API support (#539, #548).
Xunfei API Model & Update: Explore new possibilities with the integration and update of Xunfei API (#547, #572).

🛠 Improvements & Fixes

Local Debug Mode Restriction: Enhanced resource management in local debug mode (#522 by @yingfhu).
Various Fixes and Updates: Addressing typos, import issues, and log redirections for smoother operation (#520, #549, #551, #555, #564).

📚 Documentation Updates

Enhanced README and FAQs: Get all your queries answered and understand OpenCompass better with updated documentation (#523, #531, #535, #540, #567).
Typo Corrections: Ensuring clarity and accuracy in our documentation (#530, #533).

🎊 New Contributors

A warm welcome to the new members of the OpenCompass community!

@Sanster, @ayushrakesh, @HimanshuMahto, @shresthasurav, @bittersweet1999, and @jingmingzhuo. Thank you for your valuable contributions!

Changelog

add multi model viz by @Sanster in #509
fix typo in WSC prompt by @Sanster in #520
[Fix] fix local debug mode not restrict the resources by @yingfhu in #522
Update README.md - one enhancement. by @ayushrakesh in #523
Typo error in README.md by @HimanshuMahto in #531
docs: fix typos in markdown files by @shresthasurav in #530
[Doc] Update README and FAQ by @tonysy in #535
[fFeat] Add an opensource dataset Tabmwp by @bittersweet1999 in #505
[Feature]: To be compatible with the latest version of MiniGPT-4 by @YuanLiuuuuuu in #539
[Doc] Update README by @tonysy in #540
[Feat] support xunfei api model by @yingfhu in #547
[Feature] Add support for MiniMax API by @tonysy in #548
【Feature】Update Mathbench dataset prompt and fix small errors by @liushz in #546
[Fix] fix filename typo by @yingfhu in #549
[Feat] support cidataset by @yingfhu in #538
[Fix] fix registry error with internal by @yingfhu in #551
[Fix] fix unnecessary import and update requirements by @yingfhu in #555
[Fix] fix log re-direct by @yingfhu in #564
Add py150 and maxmin by @jingmingzhuo in #562
[Doc] Update api.txt by @tonysy in #567
[Docs] add humanevalx dataset link in config by @yingfhu in #559
[Docs] fix GLUE_CoLA dataset name error by @KevinNuNu in #533
[Feature] Update xunfei api by @tonysy in #572
[Feature] Add CMB zero-shot evaluation by @Leymore in #571
[Feature] Use dataset in local path by @Leymore in #570
[Sync] update model configs by @Leymore in #574
[Sync] Bump version to 0.1.8 by @Leymore in #576

Explore the detailed changes in the full changelog.

Thank you to everyone who contributed to this release. Your efforts are immensely appreciated and are helping to make OpenCompass a more robust and versatile tool. Let's continue to push the boundaries with OpenCompass v0.1.8! 🚀🌐🛠️

Contributors

Sanster, tonysy, and 10 other contributors

Assets 2

10 Nov 11:05

Leymore

0.1.8.rc1

7f77e8d

OpenCompass v0.1.8.rc1 Pre-release

Pre-release

Provide with more parsed datasets:

OpenCompassData-core.zip

AGIEval	ARC	BBH	ceval	CLUE	cmmlu
commonsenseqa	drop	FewCLUE	flores_first100	GAOKAO-BENCH	gsm8k
hellaswag	humaneval	lambada	LCSTS	math	mbpp
mmlu	nq	openbookqa	piqa	race	siqa
strategyqa	summedits	SuperGLUE	TheoremQA	triviaqa	tydiqa
winogrande	xstory_cloze	Xsum

OpenCompassData-complete.zip

AGIEval	anli	ARC	BBH	ceval	cleva
CLUE	CMB	cmmlu	commonsenseqa	drop	ds1000
FewCLUE	flores200_dataset	flores_first100	game24	GAOKAO-BENCH	govrep
gsm8k	hellaswag	humaneval	jigsawmultilingual	lambada	lawbench
LCSTS	math	mbpp	mmlu	narrativeqa	nq
openbookqa	piqa	QASPER	race	realtoxicprompts	scibench
siqa	SQuAD2.0	strategyqa	summedits	SummScreen	SuperGLUE
TheoremQA	triviaqa	triviaqa-rc	tydiqa	winogrande	xiezhi
xlsum	xstory_cloze	Xsum	FinanceIQ

Assets 4

27 Oct 15:49

Leymore

0.1.7

6a398d1

OpenCompass v0.1.7

🌟 Highlights

Sampling Control: Enforce do_sample=False for precise control over sampling behavior in HF model.
Subjective Evaluation Guidance: Enhanced evaluation mechanisms for a more comprehensive understanding and analysis of models.
Eval Details Dump: Now, evaluation details for certain datasets are available for a deeper insight and analysis.

🚀 New Features

Eval Details Dump for a deeper insight on each test cases. (#517 by @Leymore).
MathBench Dataset and Circular Evaluator to bolster mathematical benchmarking capabilities (#408 by @liushz).
Support for Math/GMS8k Agent Config providing new avenues for configuration (#494 by @yingfhu).
Default Example Summarizer making summarization tasks more accessible (#508 by @Leymore).
Model Keyword Arguments Setting for HF Model enhancing customization (#507 by @Leymore).

🛠 Improvements & Refactorings

Local API Speed Up with fixed concurrent users for better performance (#497 by @yingfhu).
Local Runner Support for Windows expanding the platform support (#515 by @yingfhu).
Sync with Internal Implements for updated and refined functionalities (#488 by @Leymore).

🐛 Bug Fixes

Summary Default Fix for accurate summarization (#483 by @Leymore).
Enforce do_sample=False in HF Model for correct sampling behavior (#506 by @Leymore).
Invalid Link Fix in documentation for better navigation (#499 by @yingfhu).

📚 Documentation & Maintenance

Subjective Comparison Introduction for organized documentation (#510 by @frankweijue).
README Update for better project understanding (#496 by @saakshii12).
Owner Update for correct ownership information (#504 by @Leymore).

🎊 New Contributors

We're delighted to welcome new contributors to the OpenCompass community!

@saakshii12 made their first contribution in #496.
@frankweijue stepped in with their first contribution in #510.

Changelog

[Fix] Fix summary default by @Leymore in #483
[Feature] Add mathbench dataset and circular evaluator by @liushz in #408
[Sync] sync with internal implements by @Leymore in #488
Update README.md by @saakshii12 in #496
[Docs] update invalid link in docs by @yingfhu in #499
[Feat] local api speed up with fixed concurrent users by @yingfhu in #497
[Feat] support math/gms8k agent config by @yingfhu in #494
[Feature] update .owner by @Leymore in #504
[Feat] use example summarizer by default by @Leymore in #508
[Feat] Add _set_model_kwargs_torch_dtype for HF model by @Leymore in #507
Subdocs by @frankweijue in #510
[Fix] enforce do_sample=False in HF model by @Leymore in #506
[Feat] support local runner for windows by @yingfhu in #515
[Sync] Sync with internal codes 2023.10.27 by @Leymore in #517
Bump version to 0.1.7 by @Leymore in #518

The full list of changes is available in the changelog. A massive thank you to all the community members who contributed to this release. Your efforts are propelling OpenCompass further! 🙌

Embark on new explorations with OpenCompass v0.1.7!

Contributors

Leymore, frankweijue, and 3 other contributors

Assets 2

13 Oct 12:16

Leymore

0.1.6

6317da0

OpenCompass v0.1.6

Welcome to the newest version of OpenCompass! v0.1.6 brings forth exciting dataset additions, crucial fixes, and enhanced documentation. We're confident that this release will provide a better and smoother experience for all users.

🆕 Highlights:

Dataset Enrichment: Multiple additions, especially from the GLUE suite, to provide more versatility and better testing capabilities.
Documentation Revamp: Fixed dead links and updated the 'get_started' section to assist our users in navigating OpenCompass effortlessly.
Introducing New Faces: A warm welcome to our newest contributors. Your dedication and contributions are pivotal to our progress!

Dive into the details:

🌟 New Features:

📦 Datasets Galore:
- Introduced WikiText-2&103 dataset (#397)
- GLUE dataset additions:
  - CoLA (#406)
  - QQP (#438)
  - MRPC (#440)
- Lawbench dataset addition (#460)
🛠 Utilities and Enhancements:
- Re-implementation of ceval load dataset (#446)
- Integrated turbomind inference through its RPC API (#414)
- Moved fix_id_list to Retriever for better code organization (#442)
📖 Documentation and Syncs:
- Updated dataset list and get_started section (#437, #435)
- Resolved dead links in the readme (#455)
- Enhancements to LongEval and subjective evaluation (#443, #475)

🐛 Bug Fixes:

Addressed issues related to clp errors and support for bs>1 (#439)
Resolved issues concerning jieba rouge (#459, #467)
Enhanced EOS string detection for splitting (#477)
Various other fixes for optimal performance.

🎉 Welcome New Contributors:

A big shout-out to our new contributors:
- @KevinNuNu (First PR)
- @lvhan028 (First PR)

Huge thanks to all contributors! Your constant efforts make OpenCompass better with each release. 🙌 🎉

Changelog

[SIG] add WikiText-2&103 by @KevinNuNu in #397
[SIG] add GLUE_CoLA dataset by @KevinNuNu in #406
[SIG] add GLUE QQP dataset by @KevinNuNu in #438
[SIG] add GLUE_MRPC dataset by @KevinNuNu in #440
[Doc] Update dataset list by @Leymore in #437
[Fix] use eval field check by @Leymore in #441
[Sync] Update LongEval by @philipwangOvO in #443
[Fix] fix clp potential error and support bs>1 by @yingfhu in #439
[Feature] re-implement ceval load dataset by @Leymore in #446
Integrate turbomind inference via its RPC API instead of its python API by @lvhan028 in #414
[Docs] update get_started by @gaotongxiao in #435
[Refactor] Move fix_id_list to Retriever by @gaotongxiao in #442
[Docs] Fix dead links in readme by @gaotongxiao in #455
[Fix] Use jieba rouge in lcsts by @Leymore in #459
[Fix] Fix jieba rouge with empty string by @Leymore in #467
[Sync] Add subjective evaluation by @Leymore in #475
[Feature] Add lawbench by @Leymore in #460
[Fix] split if and only if complete eos string shows up by @Leymore in #477
Bump version to 0.1.6 by @Leymore in #478

For a detailed overview, check out our Full Changelog.

If you find OpenCompass beneficial, kindly star 🌟 our GitHub repository! We value your feedback, reviews, and continued support.

Contributors

lvhan028, Leymore, and 4 other contributors

Assets 2

22 Sep 11:25

gaotongxiao

0.1.5

9b21613

OpenCompass v0.1.5

Dive into our newly improved features, bug fixes, and most notably our enhanced dataset support, coming together to refine your experience.

🆕 Highlights:

Boosted Dataset Integrations: This release paves the way for support on numerous datasets like ds1000, promptbench, antropics evals, kaoshi, and many more, making OpenCompass more versatile than ever.
More Evaluation Types: We starts integrating subjective and agent-adied LLM evaluation into OpenCompass. Stay tuned!

Explore the detailed changes:

🌟 New Features:

📦 New Datasets and Features:
- ds1000 dataset support (#395)
- promptbench dataset implementation (#239)
- antropics evals dataset support (#422)
- kaoshi dataset introduction (#392)
- Initial support for subjective evaluation (#421)
- Support for GSM8k evaluation tools (#277)
- scibench evaluation added (#393)

📖 Documentation:

News updates and introduction figure in README (#375, #413)
Updated get_started.md and fixed naming issues (#377, #380)
New FAQ section added (#384)
README addition in longeval (#389)
Multimodal documentation introduced (#334)

🛠️ Bug Fixes:

Addressed a potential OOM issue (#387)
Added has_image fix to scienceqa (#391)
Resolved performance issues of visualglm (#424)
Debug logger fix for summarizer (#417)
Addressed errors in keep keys (#431)

⚙ Enhancements and Refactors:

Refinement in docs and codes for better user guidance (#409)
Custom summarizer argument added in CLI mode (#411)
mlugowl llamaadapter introduced (#405)
Enhanced mm models support on public datasets (#412)
Customized config path support (#423)

🎉 New Contributors:

A heartfelt welcome to our first-time contributors:

@wangxidong06 (First PR)
@so2liu (First PR)
@HoBeedzc (First PR)
@CuteyThyme (First PR)
@chenbohua3 (First PR)

To all contributors, old and new, thank you for continually enhancing OpenCompass! Your efforts are deeply valued. 🙌 🎉

If you love OpenCompass, don't forget to star 🌟 our GitHub repository! Your feedback, reviews, and contributions immensely help in shaping the product.

Changelog

[Doc] Update News by @tonysy in #375
Update get_started.md by @liushz in #377
[CI] Publish to Pypi by @gaotongxiao in #366
[Docs] Fix incorrect name in get_started by @gaotongxiao in #380
fix potential OOM issue by @cdpath in #387
[Docs] Add FAQ by @gaotongxiao in #384
Add CMB by @wangxidong06 in #376
[Fix]: Add has_image to scienceqa by @YuanLiuuuuuu in #391
[Feat] support ds1000 dataset by @yingfhu in #395
[Feat] implementation for support promptbench by @yingfhu in #239
[Feat] refine docs and codes for more user guides by @yingfhu in #409
[Docs] Readme in longeval by @philipwangOvO in #389
feat: add custom summarizer argument in CLI run mode 在CLI启动模式中添加自定义Summarizer参数 by @so2liu in #411
Yhzhang/add mlugowl llamaadapter by @ZhangYuanhan-AI in #405
[Feat] Support mm models on public dataset and fix several issues. by @yyk-wew in #412
[Docs] Add intro figure to README by @gaotongxiao in #413
[fix] summarizer debug logger by @HoBeedzc in #417
[Doc] Update news by @Leymore in #420
[Feature] Use local accuracy from hf implements by @Leymore in #416
[Feat] support antropics evals dataset by @yingfhu in #422
[Fix] Fix performance issue of visualglm. by @yyk-wew in #424
[Feature] Log gold answer in prediction output by @gaotongxiao in #419
Support GSM8k evaluation with tools by Lagent and LangChain by @mzr1996 in #277
[Sync] Initial support of subjective evaluation by @gaotongxiao in #421
[Fix] P0: errors in keep keys by @gaotongxiao in #431
add evaluation of scibench by @CuteyThyme in #393
[Feature] Add kaoshi dataset by @liushz in #392
[Docs] Add multimodal docs by @fangyixiao18 in #334
support customize config path by @chenbohua3 in #423

Full Changelog: 0.1.4...0.1.5

Contributors

tonysy, so2liu, and 15 other contributors

Assets 2

08 Sep 13:18

gaotongxiao

0.1.4

c7a8b8f

OpenCompass v0.1.4

OpenCompass v0.1.4 is here with an array of features, documentation improvements, and key fixes! Dive in to see what's in store:

🆕 Highlights:

More Tools and Features: OpenCompass continues to expand its repertoire with the addition of tools like update suffix, codellama, preds collection tools, qwen & qwen-chat support, and more. Not forgetting our attention to Otter and the MMBench Evaluation!
Documentation Facelift: We've made several updates to our documentation, ensuring it stays relevant, user-friendly, and aesthetically pleasing.
Essential Bug Fixes: We’ve tackled numerous bugs, especially those concerning tokens, triviaqa, nq postprocess, and qwen config.
Enhancements: From simplifying execution logic to suppressing warnings, we’re always on the lookout for ways to improve our product.

Dive deeper to learn more:

🌟 New Features:

📦 Tools and Integrations:

Application of update suffix tool (#280).
Support for codellama and preds collection tools (#335).
Addition of qwen & qwen-chat support (#286).
Introduction of Otter to OpenCompass MMBench Evaluation (#232).
Support for LLaVA and mPLUG-Owl (#331).

🛠 Utilities and Functionality:

Enhanced sample count in prompt_viewer (#273).
Ignored ZeroRetriever error when id_list provided (#340).
Improved default task size (#360).

📝 Documentation:

Updated communication channels: WeChat and Discord (#328).
Documentation theme revamped for a fresh look (#332).
Detailed documentation for the new entry script (#246).
MMBench documentation updated (#336).

🛠️ Bug Fixes:

Resolved issue when missing both pad and eos token (#287).
Addressed triviaqa & nq postprocess glitches (#350).
Fixed qwen configuration inaccuracies (#358).
Default value added for zero retriever (#361).

⚙ Enhancements and Refactors:

Streamlined execution logic in run.py and ensured temp files cleanup (#337).
Suppressed unnecessary warnings raised by get_logger (#353).
Import checks of multimodal added (#352).

🎉 New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@Luodian (First PR)
@ZhangYuanhan-AI (First PR)
@HAOCHENYE (First PR)

Thank you to the entire community for pushing OpenCompass forward. Make sure to star 🌟 our GitHub repository if OpenCompass aids your endeavors! We treasure your feedback and contributions.

Changelog

[Feature] Add and apply update suffix tool by @Leymore in #280
support sample count in prompt_viewer by @cdpath in #273
docs: update wechat and discord by @vansin in #328
[Docs] Update doc theme by @gaotongxiao in #332
[Feat] support codellama and preds collection tools by @yingfhu in #335
[Feature] Add qwen & qwen-chat support by @Leymore in #286
[Feat] Add Otter to OpenCompass MMBench Evaluation by @Luodian in #232
[Docs] Update docs for new entry script by @gaotongxiao in #246
[Fix] Fix when missing both pad and eos token by @Leymore in #287
[Doc] Update MMBench.md by @kennymckormick in #336
[Feat] Support LLaVA and mPLUG-Owl by @ZhangYuanhan-AI in #331
[Feature] Ignore ZeroRetriever error when id_list provided by @Leymore in #340
[Enhance] Add import check of multimodal by @fangyixiao18 in #352
[Sync] [Enhancement] Simplify execution logic in run.py; use finally to clean up temp files by @gaotongxiao in #337
[Fix] Fix triviaqa & nq postprocess by @Leymore in #350
[Enhance] Supress warning raised by get_logger by @HAOCHENYE in #353
[Fix] Update qwen config by @Leymore in #358
[Fix] zero retriever add default value by @Leymore in #361
[Enhancement] Increase default task size by @gaotongxiao in #360
[Fix] Quick lint fix by @Leymore in #362
[Docs] update code evaluator docs by @yingfhu in #354
[Feat] support wizardcoder series by @yingfhu in #344
[Feat] Support Qwen-VL-Chat on MMBench. by @yyk-wew in #312
[Feature] Update claude2 postprocessor by @gaotongxiao in #365
[Doc] Update Overview by @tonysy in #242
[Feat] Update URL by @tonysy in #368
[Feature] Update llama2 implement by @Leymore in #372
[Feature] Add open source dataset eval config of instruct-blip by @fangyixiao18 in #370
[Fix] Update bbh implement & Fix bbh suffix by @Leymore in #371
[Feaure] Add new models: baichuan2, tigerbot, vicuna v1.5 by @Leymore in #373
Bump version to 0.1.4 by @gaotongxiao in #367

For an exhaustive list of changes, kindly check our Full Changelog.

Contributors

tonysy, cdpath, and 10 other contributors

Assets 2

25 Aug 10:56

gaotongxiao

0.1.3

b2d602f

OpenCompass v0.1.3

OpenCompass keeps getting better! v0.1.3 brings a variety of enhancements, new features, and crucial fixes. Here’s a summary of what we've packed into this release:

🆕 Highlights:

Extended Dataset Support: OpenCompass now integrates a broader range of public datasets, including but not limited to adv_glue, codegeex2, Humanevalx, SEED-Bench, LongBench, and LEval. We aim to provide extensive coverage to cater to a variety of research needs.
Utility Additions: From the inclusion of multi-modal evaluations on MME benchmark to the Tree-of-Thought method, this release comes packed with functionality enhancements.
Bug Extermination: Your feedback helps us grow. We’ve squashed a series of bugs to improve your experience.
More Evaluation Benchmark for Multimodal Models. We support another 10 evaluation benchmarks for multimodal models, including COCO Caption and ScienceQA, and provide corresponding evaluation code.

Let's delve deeper into what's new:

🌟 New Features:

📦 Extended Dataset Support:

Introduction of other public datasets (#206, #214).
Support for adv_glue dataset focused on adversarial robustness (#205).
Added codegeex2, Humanevalx (#210).
Integration of SEED-Bench (#203).
LongBench support (#236).
Reconstruct LEval dataset (#266).
Support another 10 public evaluation benchmarks for multimodal models (#214)

🛠 Utilities and Functionality:

Launch script added for ease of operations (#222).
Multi-modal evaluation on MME benchmark (#197).
Support for visualglm and llava on MMBench evaluation (#211).
Tree-of-Thought method introduced (#173).
Introduction of llama2 native implementations (#235).
Flamingo and Claude support added (#258, #253).

📝 Documentation:

Navigation bar language type updated for better clarity (#212).
News updates for keeping users informed (#241, #243).
Summarizer documentation added (#231).

🛠️ Bug Fixes:

Addressed an issue with multiple rounds of inference using mm_eval (#201).
Miscellaneous fixes such as name adjustments, requirements, and bin_trim corrections (#223, #229, #237).
Local runner debug issue fixed (#238).
Resolved bugs for PeftModel generate (#252).

⚙ Enhancements and Refactors:

Refactored instructblip for better performance and readability (#227).
Improved crowspairs postprocess (#251).
Optimization to use sympy only when necessary (#255).

🎉 New Contributors:

Thank you to all our contributors for this release, with a special shoutout to our new contributors:

@yyk-wew (First PR)
@fangyixiao18 (First PR)
@philipwangOvO (First PR)
@cdpath (First PR)

Thank you to our dedicated contributors for making OpenCompass even more comprehensive and user-friendly! 🙌 🎉

Remember to star 🌟 our GitHub repository if you find OpenCompass helpful! Your feedback and contributions are invaluable.

Change log

[Fix] Fix bugs of multiple rounds of inference when using mm_eval by @yyk-wew in #201
[Feature]: Add other public datasets by @YuanLiuuuuuu in #206
[Doc] Update Navigation bar language type by @Ezra-Yu in #212
[Feat] support adv_glue dataset for adversarial robustness by @yingfhu in #205
[Feat] Add codegeex2 and Humanevalx by @Ezra-Yu in #210
[Feature]: Add other public datasets config by @YuanLiuuuuuu in #214
[Feature] Support SEED-Bench by @fangyixiao18 in #203
[Feature]: Add launch script by @YuanLiuuuuuu in #222
[Fix]: Fix name by @YuanLiuuuuuu in #223
[Fix] requirements by @gaotongxiao in #229
[Dataset] LongBench by @philipwangOvO in #236
[Fix] bin_trim by @philipwangOvO in #237
[Feat] Support multi-modal evaluation on MME benchmark. by @yyk-wew in #197
[Feat] Support visualglm and llava for MMBench evaluation. by @yyk-wew in #211
[Fix] fix local runner debug by @Leymore in #238
Update News by @tonysy in #241
[Doc]update news by @tonysy in #243
Update run.py by @liushz in #247
[Doc] Add summarizer doc by @Leymore in #231
[Feature] Add llama2 native implements by @Leymore in #235
[Feature] Add Tree-of-Thought method by @liushz in #173
[Refactor] Refactor instructblip by @fangyixiao18 in #227
[Enhancement] Update crowspairs postprocess by @gaotongxiao in #251
[Fix] use sympy only when necessary by @gaotongxiao in #255
Update .owners.yml by @tonysy in #261
[Fix] Fix bugs for PeftModel generate by @LZHgrla in #252
[Feature]: Add Flamingo by @YuanLiuuuuuu in #258
[Feature] Add Claude support by @gaotongxiao in #253
[Dataset] Reconstruct LEval by @philipwangOvO in #266
[Feature]: Verify the acc of these public datasets by @YuanLiuuuuuu in #269
- [Feat] Support public dataset of visualglm and llava. by @yyk-wew in #265
[Fix] wrong path in dataset collections by @gaotongxiao in #272
[Fix] update descriptions of tools by @cdpath in #270
[Feature] Support model-bound prediction postprocessor, use it in Claude by @gaotongxiao in #268
[Feature] Simplify entry script by @gaotongxiao in #204
Update README.md by @tonysy in #262

For a complete list of changes, please refer to our Full Changelog.

Contributors

tonysy, cdpath, and 10 other contributors

Assets 2

Releases: open-compass/opencompass

OpenCompass v0.2.1

🌟 Highlights:

🚀 New Features:

🐛 Bug Fixes:

🎉 Welcome New Contributors:

🔗 Full Changelog

Contributors

OpenCompass v0.2.0

🌟 Highlights

🚀 New Features & Enhancements

🛠 Improvements & Fixes

📚 Documentation Updates

🎊 New Contributors

🔗 Full Changelog

Contributors

OpenCompass v0.1.9

🌟 Highlights

🚀 New Features & Enhancements

🛠 Improvements & Fixes

📚 Documentation Updates

🎊 New Contributors

What's Changed

Contributors

OpenCompass v0.1.8

🔥 Highlights

🚀 New Features & Enhancements

🛠 Improvements & Fixes

📚 Documentation Updates

🎊 New Contributors

Changelog

Contributors

OpenCompass v0.1.8.rc1

OpenCompass v0.1.7

🌟 Highlights

🚀 New Features

🛠 Improvements & Refactorings

🐛 Bug Fixes

📚 Documentation & Maintenance

🎊 New Contributors

Changelog

Contributors

OpenCompass v0.1.6

🆕 Highlights:

🌟 New Features:

🐛 Bug Fixes:

🎉 Welcome New Contributors:

Changelog

Contributors

OpenCompass v0.1.5

🆕 Highlights:

🌟 New Features:

📖 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Changelog

Contributors

OpenCompass v0.1.4

🆕 Highlights:

🌟 New Features:

📝 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Changelog

Contributors

OpenCompass v0.1.3

🆕 Highlights:

🌟 New Features:

📝 Documentation:

🛠️ Bug Fixes:

⚙ Enhancements and Refactors:

🎉 New Contributors:

Change log

Contributors