RISS 검색 - 학위논문 상세보기

통합검색

국내학술논문

학위논문

해외학술논문

학술지

단행본

연구보고서

공개강의

Building Topic-Specific Search Engines : A Data Mining Approach

저자

이정희
발행사항

Los Angeles : University of California, 2000
학위논문사항

Thesis(doctoral)-- University of California: Computer Science 2000
발행연도

2000
작성언어

영어
주제어
KDC

569 판사항(4)
발행국(도시)

California
형태사항

xix, 153p. : Charts ; 26cm
일반주기명

References: p. 143-153
소장기관

내보내기
내책장담기
공유하기
- URL 복사
오류접수

이 자료를 본 이용자가 본 자료

나만을 위한 추천자료

다국어 초록 (Multilingual Abstract)

Topic specific search engines are becoming popular with the phenomental growth of the World Wide Web. They have higher accuracy rate than general purpose search engines, and offer functions they cannot provide. But the topic-specific search engines available nowadays have very low cost-efficiency, because they require intensive human labor, and thus enormous cost, to upkeep as weell as to build. Efficient processing of the exploding information in the World Wide Web seems to call for smarter search engines, topic-specific search engines that require far less human labor while performing almost as well as those built and maintained by humans. This dissertation is a contribution towards meeting this demand. Building and maintaining topic-specific search engines with minimal human labor requires an automatic or semi-automatic informatino gathering system, the outputs of which can be fed to the search engines. In the dissertation, I discuss techniques for four major components of the requisite information gathering system:
(1) Domain information extraction
(2) Topic expansion
(3) Topic-driven information gathering
(4) Text-classification system for web documents
I also discuss the performance of the prototype system, a search engine for XML, that I built to test the techniques.

목차 (Table of Contents)

TALLE OF CONTENTS = iv
ABSTRACT OF THE DISSERTATION = xviii
CHAPTER 1 Introduction = 1
1.1 Web search engines = 2
1.1.1 General purpose search engines = 3
1.1.2 Internet directories and portals = 4
1.2 Challenges in building topic-specific information gathering system = 5
1.2.1 Target topic expansion = 6
1.2.2 Topic-driven crawling = 8
1.2.3 Determination of relevance = 9
1.3 Testbed system = 11
1.4 Brief survey of related works = 12
1.5 The contribution and structure of the dissertation = 15
CHAPTER 2 Preliminaries: The structure of WWW = 20
2.1 The graph structure and connectivity of the Web = 20
2.2 The link structuresof the Web = 21
2.3 The Web communities = 22
2.4 Regularitie in hyperlinking and web surfing = 23
CHAPTER 3 Web Page Metadata = 26
3.1 Introduction = 26
3.2 Hyperlink metadata = 26
3.3 HTML metadata = 28
3.4 Extracting metadata = 31
3.5 Metadata anlysis = 31
3.6 Conclusion = 33
CHAPTER 4 Domain Information Extraction = 34
4.1 Introduction = 34
4.2 Duality based information extraction = 35
4.2.1 Information extraction from the WWW = 36
4.2.2 Previous work on duality problems on the Web = 37
4.3 Duality of relations and patterns on the Web = 38
4.4 Higher level duality problems = 40
4.5 Solving a 2-level duality problem: Mining for acronyms = 40
4.5.1 Problem description = 41
4.5.2 Acronym formation rules = 42
4.5.3 Patterns = 43
4.5.4 The mining algorithm = 45
4.5.5 Experiments on pattern learning = 47
4.6 Conclusion = 51
CHAPTER 5 Topic Expansion = 53
5.1 Introduction = 53
5.2 Topic expansion algorithm = 54
5.3 Discovering domain terms = 55
5.4 Discovering candidate topic terms = 56
5.4.1 Relevance metrics applied to association rules = 58
5.4.2 Candidate term mining algorithm = 59
5.5 Discovering relevant topic terms = 60
5.5.1 Filtering by stop words = 60
5.5.2 Filtering by sampling = 61
5.5.3 Filtering by specialization = 62
5.5.4 Sampling vs. specialization = 64
5.6 Experiments = 64
5.6.1 Candidate topic mining = 66
5.6.2 filtering for relevant topics = 66
5.7 Conclusion = 70
CHAPTER 6 Topic-Driven Crawling = 71
6.1 Introduction = 71
6.2 Backgrojund: Web crawling = 72
6.2.1 Typical web crawling = 72
6.2.2 Topic-directed crawling = 73
6.3 The metadate database = 74
6.4 Topic-directed crawling algorithms = 75
6.4.1 Hypertext Induced Topic Search (HITS) = 76
6.4.2 Simple Heuristics (SH) = 78
6.4.3 Relevance Weighting (RW) = 79
6.4.4 Relevance Weighting with Boosting (RWB) = 80
6.4.5 HITS with RW (HITS-RW) = 81
6.5 Performance experiments = 82
6.5.1 Performance of RW = 83
6.5.2 Performance of RWB = 83
6.5.3 Performance of HITS = 86
6.5.4 Performance of HITS-RW = 86
6.5.5 Performance of SH = 87
6.6 Conclusion = 87
CHAPTER 7 Semi-structued Document Classification = 89
7.1 Introduction = 89
7.2 New challenges in classifying documents on the Web = 90
7.2.1 Limitations of conventional classifiers = 90
7.2.2 new approach of classifying web documents = 91
7.3 Background: overview of text classification = 92
7.3.1 Vector space model = 92
7.3.2 Bernoulli document generation model = 93
7.4 Semi-structured document classification = 93
7.4.1 Structured vector model = 94
7.4.2 Document generation model = 98
7.4.3 Path expression by tag augmentation = 99
7.4.4 The semi-structured document classifier = 100
7.5 Experiments = 101
7.5.1 Datasets = 101
7.5.2 Experiments on Patent dataset = 102
7.5.3 Experiments on Re´sume´ dataset = 103
7.6 Conclusion = 104
CHAPTER 8 Evaluation of Topic-Driven Information Gathering = 105
8.1 Introduction = 105
8.2 Metrics: sample precision and sample coverage = 105
8.3 The sample set = 106
8.4 Experiments = 107
8.4.1 Sample set statistics = 107
8.4.2 Sample coverage and sample precision of the topic-driven crawl = 108
8.4.3 Other quality measures of topic-driven crawl = 111
8.5 Summary = 112
CHAPTER 9 Related Word = 115
9.1 Web topology = 115
9.1.1 The power law distribution = 115
9.1.2 Applying graph theoretic methods to the web = 117
9.2 Search engines = 119
9.2.1 Popularity based search engines = 119
9.2.2 Internet directories = 121
9.2.3 Personalized information systems = 122
9.3 Matadata = 124
9.4 Information extraction and duality lmining = 125
9.5 Topic-specific information gathering = 127
9.5.1 HITS - Kleinberg's algorithm = 127
9.5.2 PageRank = 129
9.5.3 MetaCrawler = 129
9.5.4 Focused Crawler = 129
9.5.5 Cora = 130
9.5.6 Citeseer = 131
9.6 Association = 131
9.7 Classification = 132
9.7.1 Conventional classification = 133
9.7.2 Classifiers for non-conventional documents = 134
9.8 Semi-structured documents (management systems) = 135
CHAPTER 10 Conclusion = 136
10.1 Research contributions = 136
10.2 Future work = 138
APPENDIX A Metadata for HTML hyperlinks = 139
A.1 Anchor () tag = 139

A.1.1 Attributes = 139

A.1.2 Event handler attributes = 139

A.2 tag = 140

A.2.1 Attributes = 140

A.2.2 Event handler attributes = 140

A.3 tag = 140

A.3.1 Attributes = 140

A.4