Web page classification: Features and algorithms

X Qi, BD Davison - ACM computing surveys (CSUR), 2009 - dl.acm.org
Classification of Web page content is essential to many tasks in Web information retrieval
such as maintaining Web directories and focused crawling. The uncontrolled nature of Web …

[图书][B] The text mining handbook: advanced approaches in analyzing unstructured data

R Feldman, J Sanger - 2007 - books.google.com
Text mining is a new and exciting area of computer science research that tries to solve the
crisis of information overload by combining techniques from data mining, machine learning …

Link prediction in relational data

B Taskar, MF Wong, P Abbeel… - Advances in neural …, 2003 - proceedings.neurips.cc
Many real-world domains are relational in nature, consisting of a set of objects related to
each other in complex ways. This paper focuses on predicting the existence and the type of …

A study of thresholding strategies for text categorization

Y Yang - Proceedings of the 24th annual international ACM …, 2001 - dl.acm.org
Thresholding strategies in automated text categorization are an underexplored area of
research. This paper presents an examination of the effect of thresholding strategies on the …

[PDF][PDF] Distributional word clusters vs. words for text categorization

R Bekkerman, R El-Yaniv, N Tishby, Y Winter - Journal of Machine …, 2003 - jmlr.org
We study an approach to text categorization that combines distributional clustering of words
and a Support Vector Machine (SVM) classifier. This word-cluster representation is …

A study of approaches to hypertext categorization

Y Yang, S Slattery, R Ghani - Journal of Intelligent Information Systems, 2002 - Springer
Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags,
category labels distributed over linked documents, and meta data extracted from related …

[PDF][PDF] Learning probabilistic models of link structure

L Getoor, N Friedman, D Koller, B Taskar - Journal of Machine Learning …, 2002 - jmlr.org
Most real-world data is heterogeneous and richly interconnected. Examples include the
Web, hypertext, bibliometric data and social networks. In contrast, most statistical learning …

Assam: A tool for semi-automatically annotating semantic web services

A Heß, E Johnston, N Kushmerick - The Semantic Web–ISWC 2004: Third …, 2004 - Springer
Abstract The semantic Web Services vision requires that each service be annotated with
semantic metadata. Manually creating such metadata is tedious and error-prone, and many …

Discovering missing links in Wikipedia

SF Adafre, M de Rijke - Proceedings of the 3rd international workshop …, 2005 - dl.acm.org
In this paper we address the problem of discovering missing hypertext links in Wikipedia.
The method we propose consists of two steps: first, we compute a cluster of highly similar …

Automated subject classification of textual web documents

K Golub - Journal of documentation, 2006 - emerald.com
Purpose–To provide an integrated perspective to similarities and differences between
approaches to automated classification in different research communities (machine learning …