SciREX: A challenge dataset for document-level information extraction

S Jain, M Van Zuylen, H Hajishirzi, I Beltagy - arXiv preprint arXiv …, 2020 - arxiv.org
arXiv preprint arXiv:2005.00512, 2020arxiv.org
Extracting information from full documents is an important problem in many domains, but
most previous work focus on identifying relationships within a sentence or a paragraph. It is
challenging to create a large-scale information extraction (IE) dataset at the document level
since it requires an understanding of the whole document to annotate entities and their
document-level relationships that usually span beyond sentences or even sections. In this
paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE …
Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level -ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX
arxiv.org
以上显示的是最相近的搜索结果。 查看全部搜索结果