The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only

G Penedo, Q Malartic, D Hesslow… - Advances in …, 2023 - proceedings.neurips.cc
Large language models are commonly trained on a mixture of filtered web data and
curated``high-quality''corpora, such as social media conversations, books, or technical …

A survey of reasoning with foundation models

J Sun, C Zheng, E Xie, Z Liu, R Chu, J Qiu, J Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-
world settings such as negotiation, medical diagnosis, and criminal investigation. It serves …

Distilled GPT for source code summarization

CY Su, C McMillan - Automated Software Engineering, 2024 - Springer
A code summary is a brief natural language description of source code. Summaries are
usually only a single sentence long, and yet form the backbone of developer documentation …

Source code summarization in the era of large language models

W Sun, Y Miao, Y Li, H Zhang, C Fang, Y Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
To support software developers in understanding and maintaining programs, various
automatic (source) code summarization techniques have been proposed to generate a …

A survey of neural code intelligence: Paradigms, advances and beyond

Q Sun, Z Chen, F Xu, K Cheng, C Ma, Z Yin… - arXiv preprint arXiv …, 2024 - arxiv.org
Neural Code Intelligence--leveraging deep learning to understand, generate, and optimize
code--holds immense potential for transformative impacts on the whole society. Bridging the …

Concerned with Data Contamination? Assessing Countermeasures in Code Language Model

J Cao, W Zhang, SC Cheung - arXiv preprint arXiv:2403.16898, 2024 - arxiv.org
Various techniques have been proposed to leverage the capabilities of code language
models (CLMs) for SE tasks. While these techniques typically evaluate their effectiveness …

A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

B Steenhoek, MM Rahman, MK Roy, MS Alam… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated great potential for code generation and
other software engineering tasks. Vulnerability detection is of crucial importance to …

CodeS: Natural Language to Code Repository via Multi-Layer Sketch

D Zan, A Yu, W Liu, D Chen, B Shen, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The impressive performance of large language models (LLMs) on code-related tasks has
shown the potential of fully automated software development. In light of this, we introduce a …

Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries

X He, M Zhou, X Xu, X Ma, R Ding, L Du… - Proceedings of the …, 2024 - ojs.aaai.org
Tabular data analysis is crucial in various fields, and large language models show promise
in this area. However, current research mostly focuses on rudimentary tasks like Text2SQL …

On the effectiveness of large language models for github workflows

X Zhang, S Muralee, S Cherupattamoolayil… - Proceedings of the 19th …, 2024 - dl.acm.org
GitHub workflows or GitHub CI is a popular continuous integration platform that enables
developers to automate various software engineering tasks by specifying them as workflows …