Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium- 学术资源搜索

[PDF][PDF] Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium

S Evert, A Hardie - Proceedings of the Corpus Linguistics 2011 …, 2011 - Citeseer

Proceedings of the Corpus Linguistics 2011 conference, 2011•Citeseer

Abstract

Corpus Workbench (CWB) is a widely-used architecture for corpus analysis, originally designed at the IMS, University of Stuttgart (Christ 1994). It consists of a set of tools for indexing, managing and querying very large corpora with multiple layers of word-level annotation. CWB’s central component is the Corpus Query Processor (CQP), an extremely powerful and efficient concordance system implementing a flexible two-level search language that allows complex query patterns to be specified both at the level of an individual word or annotation, and at the level of a fully-or partially-specified pattern of tokens. CWB and CQP are commonly used as the back-end for web-based corpus interfaces, for example, in the popular BNCweb interface to the British National Corpus (Hoffmann et al. 2008). CWB has influenced other tools, such as the Manatee software used in SketchEngine, which implements the same query language (Kilgarriff et al. 2004).

This paper details recent work to update CWB for the new century. Perhaps the most significant development is that CWB version 3 is now an open source project, licensed under the GNU General Public Licence. This change has substantially enlarged the community of developers and users and has enabled us to leverage existing open-source libraries in extending CWB’s capabilities. As a result, several key improvements were made to the CWB core:(i) support for multiple character sets, most especially Unicode (in the form of UTF-8), allowing all the world’s writing systems to be utilised within a CWB-indexed corpus;(ii) support for powerful Perl-style regular expressions in CQP queries, based on the open-source PCRE library;(iii) support for a wider range of OS platforms including Mac OS X, Linux, and Windows; and (iv) support for larger corpus sizes of up to 2 billion words on 64-bit platforms.

Citeseer

展开收起

被引用次数：329 相关文章所有 9 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果