Ultrafeedback: Boosting language models with high-quality feedback
Reinforcement learning from human feedback (RLHF) has become a pivot technique in
aligning large language models (LLMs) with human preferences. In RLHF practice …
aligning large language models (LLMs) with human preferences. In RLHF practice …
Octopack: Instruction tuning code large language models
Finetuning large language models (LLMs) on instructions leads to vast performance
improvements on natural language tasks. We apply instruction tuning using code …
improvements on natural language tasks. We apply instruction tuning using code …
Large language models for code analysis: Do {LLMs} really do their job?
Large language models (LLMs) have demonstrated significant potential in the realm of
natural language understanding and programming code processing tasks. Their capacity to …
natural language understanding and programming code processing tasks. Their capacity to …
DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents
Recent LLMs have demonstrated remarkable performance in solving exam-like math word
problems. However, the degree to which these numerical reasoning skills are effective in …
problems. However, the degree to which these numerical reasoning skills are effective in …
KnowledgeFMath: A knowledge-intensive math reasoning dataset in finance domains
We introduce KnowledgeFMath, a novel benchmark designed to evaluate LLMs' capabilities
in solving knowledge-intensive math reasoning problems. Compared to prior works, this …
in solving knowledge-intensive math reasoning problems. Compared to prior works, this …
StudentEval: a benchmark of student-written prompts for large language models of code
Code LLMs are being rapidly deployed and there is evidence that they can make
professional programmers more productive. Current benchmarks for code generation …
professional programmers more productive. Current benchmarks for code generation …
Towards understanding the capability of large language models on code clone detection: a survey
Code cloning, the duplication of code fragments, is common in software development. While
some reuse aids productivity, excessive cloning hurts maintainability and introduces bugs …
some reuse aids productivity, excessive cloning hurts maintainability and introduces bugs …
Docmath-eval: Evaluating numerical reasoning capabilities of llms in understanding long documents with tabular data
Recent LLMs have demonstrated remarkable performance in solving exam-like math word
problems. However, the degree to which these numerical reasoning skills are effective in …
problems. However, the degree to which these numerical reasoning skills are effective in …
Coffee: Boost your code llms by fixing bugs with feedback
Code editing is an essential step towards reliable program synthesis to automatically correct
critical errors generated from code LLMs. Recent studies have demonstrated that closed …
critical errors generated from code LLMs. Recent studies have demonstrated that closed …
Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries
Tabular data analysis is crucial in various fields, and large language models show promise
in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL …
in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL …