Impact of tokenization on language models: An analysis for turkish

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

被引用次数：73 相关文章所有 3 个版本

[PDF] authorea.com

ChatGPT vs. Bard: a comparative study

I Ahmed, A Roy, M Kajol, U Hasan, PP Datta… - Authorea …, 2023 - authorea.com

The rapid progress in conversational AI has given rise to advanced language models
capable of generating human-like texts. Among these models, ChatGPT and Bard …

被引用次数：74 相关文章所有 11 个版本

[PDF] arxiv.org

Towards General Industrial Intelligence: A Survey on IIoT-Enhanced Continual Large Models

J Chen, J He, F Chen, Z Lv, J Tang, W Li, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Currently, most applications in the Industrial Internet of Things (IIoT) still rely on CNN-based
neural networks. Although Transformer-based large models (LMs), including language …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Tokenizer Choice For LLM Training: Negligible or Crucial?

M Ali, M Fromm, K Thellmann, R Rutmann… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent success of LLMs has been predominantly driven by curating the training dataset
composition, scaling of model architectures and dataset sizes and advancements in …

被引用次数：22 相关文章所有 4 个版本

[PDF] gla.ac.uk

Reproducibility, Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col

X Wang, C Macdonald, N Tonellotto… - Proceedings of the 46th …, 2023 - dl.acm.org

Dense multi-representation retrieval models, exemplified as ColBERT, estimate the
relevance between a query and a document based on the similarity of their contextualised …

被引用次数：7 相关文章所有 3 个版本

[PDF] aclanthology.org

Hints on the data for language modeling of synthetic languages with transformers

R Zevallos, N Bel - Proceedings of the 61st Annual Meeting of the …, 2023 - aclanthology.org

Abstract Language Models (LM) are becoming more and more useful for providing
representations upon which to train Natural Language Processing applications. However …

被引用次数：4 相关文章所有 2 个版本

[PDF] researchsquare.com

BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting

H Türkmen, O Dikenelli, C Eraslan, MC Callı… - Journal of Healthcare …, 2023 - Springer

Pretrained language models augmented with in-domain corpora show impressive results in
biomedicine and clinical Natural Language Processing (NLP) tasks in English. However …

被引用次数：15 相关文章所有 6 个版本

[PDF] arxiv.org

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

S Tamang, DJ Bora - arXiv preprint arXiv:2410.03718, 2024 - arxiv.org

Training of a tokenizer plays an important role in the performance of deep learning models.
This research aims to understand the performance of tokenizers in five state-of-the-art …

被引用次数：2 相关文章所有 3 个版本

[PDF] ceur-ws.org

[PDF][PDF] ARC-NLP at CheckThat!-2022: Contradiction for Harmful Tweet Detection.

C Toraman, O Ozcelik, F Sahinuç, U Sahin - CLEF (Working Notes), 2022 - ceur-ws.org

The target task of our team in CLEF2022 CheckThat! Lab challenge is Task-1C, harmful
tweet detection. We propose a novel approach, called ARC-NLP-contra, which is a …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Harnessing the power of BERT in the Turkish clinical domain: pretraining approaches for limited data scenarios

H Türkmen, O Dikenelli, C Eraslan, MC Çallı… - arXiv preprint arXiv …, 2023 - arxiv.org

In recent years, major advancements in natural language processing (NLP) have been
driven by the emergence of large language models (LLMs), which have significantly …

被引用次数：6 相关文章所有 7 个版本