Ultrafeedback: Boosting language models with high-quality feedback

G Cui, L Yuan, N Ding, G Yao, W Zhu, Y Ni, G Xie, Z Liu… - 2023 - openreview.net
Reinforcement learning from human feedback (RLHF) has become a pivot technique in
aligning large language models (LLMs) with human preferences. In RLHF practice …

Octopack: Instruction tuning code large language models

N Muennighoff, Q Liu, A Zebaze, Q Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org
Finetuning large language models (LLMs) on instructions leads to vast performance
improvements on natural language tasks. We apply instruction tuning using code …

Large language models for code analysis: Do {LLMs} really do their job?

C Fang, N Miao, S Srivastav, J Liu, R Zhang… - 33rd USENIX Security …, 2024 - usenix.org
Large language models (LLMs) have demonstrated significant potential in the realm of
natural language understanding and programming code processing tasks. Their capacity to …

DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents

Y Zhao, Y Long, H Liu, R Kamoi, L Nan… - Proceedings of the …, 2024 - aclanthology.org
Recent LLMs have demonstrated remarkable performance in solving exam-like math word
problems. However, the degree to which these numerical reasoning skills are effective in …

KnowledgeFMath: A knowledge-intensive math reasoning dataset in finance domains

Y Zhao, H Liu, Y Long, R Zhang, C Zhao… - Proceedings of the …, 2024 - aclanthology.org
We introduce KnowledgeFMath, a novel benchmark designed to evaluate LLMs' capabilities
in solving knowledge-intensive math reasoning problems. Compared to prior works, this …

StudentEval: a benchmark of student-written prompts for large language models of code

HML Babe, S Nguyen, Y Zi, A Guha… - arXiv preprint arXiv …, 2023 - arxiv.org
Code LLMs are being rapidly deployed and there is evidence that they can make
professional programmers more productive. Current benchmarks for code generation …

Towards understanding the capability of large language models on code clone detection: a survey

S Dou, J Shan, H Jia, W Deng, Z Xi, W He, Y Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Code cloning, the duplication of code fragments, is common in software development. While
some reuse aids productivity, excessive cloning hurts maintainability and introduces bugs …

Docmath-eval: Evaluating numerical reasoning capabilities of llms in understanding long documents with tabular data

Y Zhao, Y Long, H Liu, L Nan, L Chen, R Kamoi… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent LLMs have demonstrated remarkable performance in solving exam-like math word
problems. However, the degree to which these numerical reasoning skills are effective in …

Coffee: Boost your code llms by fixing bugs with feedback

S Moon, H Chae, Y Song, T Kwon, D Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
Code editing is an essential step towards reliable program synthesis to automatically correct
critical errors generated from code LLMs. Recent studies have demonstrated that closed …

Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries

X He, M Zhou, X Xu, X Ma, R Ding, L Du… - Proceedings of the …, 2024 - ojs.aaai.org
Tabular data analysis is crucial in various fields, and large language models show promise
in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL …