[PDF][PDF] Analyzing of the evolution of web pages by using a domain based web crawler

E Uzun, T Yerlikaya, M Kurt - Engineering, Technologies and …, 2011 - erdincuzun.com
Engineering, Technologies and Systems-Techsys, 2011erdincuzun.com
To improve algorithms that are used in search engines, crawlers and indexers, the evolution
of web pages should be examined. For this purpose, we developed a domain based
crawler, namely SET Crawler, which collects the web archives between 1998 and 2008 of
three Turkish daily popular newspapers (Hurriyet, Milliyet and Sabah). After completion of
the crawl, we obtained a set of 3430997 HTML pages. While the average file size of one
web page in 1998 approximately is 5.19 KB, this size in 2008 is 53.94 KB. When considering …
Abstract
To improve algorithms that are used in search engines, crawlers and indexers, the evolution of web pages should be examined. For this purpose, we developed a domain based crawler, namely SET Crawler, which collects the web archives between 1998 and 2008 of three Turkish daily popular newspapers (Hurriyet, Milliyet and Sabah). After completion of the crawl, we obtained a set of 3430997 HTML pages. While the average file size of one web page in 1998 approximately is 5.19 KB, this size in 2008 is 53.94 KB. When considering the size of main contents of web pages are similar, this observation shows the degree of increase in the use of unnecessary contents and tags. Analyses indicate that the use of link, image and layout tags has increased significantly in the last decades. Moreover, the< div> tag has been used instead of the< table> tag, especially in Milliyet and Sabah.
erdincuzun.com
以上显示的是最相近的搜索结果。 查看全部搜索结果