Starcoder 2 and the stack v2: The next generation

A Lozhkov, R Li, LB Allal, F Cassano… - arXiv preprint arXiv …, 2024 - arxiv.org
The BigCode project, an open-scientific collaboration focused on the responsible
development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In …

Robustness, security, privacy, explainability, efficiency, and usability of large language models for code

Z Yang, Z Sun, TZ Yue, P Devanbu, D Lo - arXiv preprint arXiv:2403.07506, 2024 - arxiv.org
Large language models for code (LLM4Code), which demonstrate strong performance (eg,
high accuracy) in processing source code, have significantly transformed software …

Exploring {ChatGPT's} Capabilities on Vulnerability Management

P Liu, J Liu, L Fu, K Lu, Y Xia, X Zhang… - 33rd USENIX Security …, 2024 - usenix.org
Recently, ChatGPT has attracted great attention from the code analysis domain. Prior works
show that ChatGPT has the capabilities of processing foundational code analysis tasks …

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

S Zhang, H Li - arXiv preprint arXiv:2312.07200, 2023 - arxiv.org
Code pre-trained language models (CPLMs) have received great attention since they can
benefit various tasks that facilitate software development and maintenance. However …

Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code

V Majdinasab, A Nikanjam, F Khomh - arXiv preprint arXiv:2402.09299, 2024 - arxiv.org
Code auditing ensures that the developed code adheres to standards, regulations, and
copyright protection by verifying that it does not contain code from protected sources. The …

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

Y Wan, G Wan, S Zhang, H Zhang, P Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent years have witnessed significant progress in developing deep learning-based
models for automated code completion. Although using source code in GitHub has been a …

An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets

J Katzy, R Popescu, A Van Deursen… - … of the 2024 IEEE/ACM First …, 2024 - dl.acm.org
Does the training of large language models potentially infringe upon code licenses?
Furthermore, are there any datasets available that can be safely used for training these …