作者
Roberto Henriques, Adria Ferreira, Mauro Castelli
发表日期
2022
期刊
Journal of Data and Information Science
卷号
7
期号
3
页码范围
49-70
简介
Purpose: Patent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing worldwide. We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task.
Design/methodology/approach: We applied three different approaches in this paper. First, we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods. After preprocessing data by applying TF-IDF, FastText and Doc2Vec, the models were evaluated by cross-validation in 5 folds. In a second approach, we used two different Neural Networks architectures, a Convolutional Neural Network (CNN) and a bi-directional Long Short-Term Memory (BiLSTM). Finally, we used pre-trained BERT, DistilBERT, and ULMFiT models in the third approach.
Findings: BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results for the task, even though with a performance of only 4% superior to a LinearSVC model using TF-IDF feature engineering.
Research limitations: The dataset was highly imbalanced, as usual in patent applications, so the classes with the lowest samples were expected to present the worst performance. That result happened in some cases, especially in classes with less than 60 training samples.
Practical implications: Patent classification is challenging because of the hierarchical classification system, the context overlap, and the underrepresentation of the classes. However, the final model presented an …
引用总数
学术搜索中的文章
R Henriques, A Ferreira, M Castelli - Journal of Data and Information Science, 2022