作者
Yasamin Salimi, Tim Adams, Mehmet Can Ay, Helena Balabin, Marc Jacobs, Martin Hofmann-Apitius
发表日期
2024/4/1
简介
Data Harmonization is an important yet time-consuming process. With the recent popularity of applications using Large Language Models (LLMs) due to their high capabilities in text understanding, we investigated whether LLMs could facilitate data harmonization for clinical use cases. To evaluate this, we created PASSIONATE, a novel Parkinson's disease (PD) Common Data Model (CDM) as a ground truth source for pairwise cohort harmonization using LLMs. Additionally, we extended our investigation using an existing Alzheimer’s disease (AD) CDM. We computed text embeddings based on two LLMs to perform automated cohort harmonization for both AD and PD. We additionally compared the results to a baseline method using fuzzy string matching to determine the degree to which the semantic understanding of LLMs can improve our harmonization results. We found that mappings based on text embeddings …