RISS 검색 - 학위논문 상세보기

Abstract

Li Chenghua

Department of Information and Communication
Chebuk National University

Text categorization is an important application of machine learning to the field of document information retrieval. This thesis described two kinds of neural networks for text categorization, multi-output perceptron learning (MOPL) and back propagation neural network (BPNN). BPNN has been widely used in classification and pattern recognition. However it has some generally acknowledged defects, usually these defects evolve from some morbidity neurons In this thesis I proposed a novel adaptive learning approach for text categorization using improved back propagation neural network. This algorithm can overcome some shortcomings in traditional back propagation neural network such as slow training speed and easy to get into local minimum. We compared the training time and performance and test the three methods on the standard Reuter-21578. The results show that the proposed algorithm is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.

요약

문서분류는 정보검색에서 기계학습을 응용하는 중요한 분야이다. 본 논문에서는 다중출력 퍼셉트론 학습(Multi-Output Perceptron Learning:MOPL)과 백 프로퍼게이션 신경망(Back Propagation Neural Network:BPNN) 두 가지의 신경망 이론을 문서분류에 적용하였다.
BPNN은 분류와 패턴인식에 많이 사용되고 있지만, 치명적인 신경을 포함하는 몇 가지 결점이 있다. 본 논문에서는 향상된 백 프로퍼게이션 신경망이론을 사용한 새로운 학습법을 제안할 것이다. 이 알고리즘은 기존의 백 프로퍼게이션 신경망의 느린 학습 속도와 쉽게 국소적인 제한치로 빠지는 문제를 개선할 수 있다. 로이터 자료(Reuter-21578)을 이용하여 세 가지 방법을 테스트하고, 학습시간과 성능을 비교하였다. 정확율, 재현율, 그리고 F-mesure를 통하여 본 논문에서 제안한 문서분류 알고리즘의 높은 성능을 확인할 수 있다.

번역결과

목차 (Table of Contents)

Contents
List of figures iii
List of tables iv
Abstract v
Summary in Korean vi
1. Introduction 1
1.1 Text Categorization 1
1.2 Neural Network for Information Retrieval 2
1.3 Outline 5
2. Architecture, Applications and Approaches in Text Categorization 6
2.1 The Architecture of the Text Categorization 6
2.2 Applications of Text Categorization 8
2.2.1 Automatic indexing for Boolean information retrieval systems 8
2.2.2 Document organization 10
2.2.3 Document filtering 11
2.2.4 Word sense disambiguation 12
2.2.5 Yahoo!-style search space categorization 14
2.3 Some Well-known Approaches in Text Categorization 14
2.3.1 k Nearest Neighbor algorithm 14
2.3.2 Rocchio algorithm 15
2.3.3 Support vector machine algorithm 16
2.3.4 Decision tree classifiers 17
2.3.5 Neural networks 19
3. Basic ANN Algorithms and Improved BPNN Algorithm 21
3.1 Artificial Neural Networks 21
3.2 Neural Network Topologies 22
3.2.1 Feed-forward networks 22
3.2.2 Recurrent networks 23
3.3 Learning in Artificial Neural Networks 24
3.3.1 Supervise learning 24
3.3.2 Unsupervised learning 24
3.4 Theory of MOPL and BPNN Algorithms 25
3.4.1 Basic theory of MOPL algorithm 25
3.4.2 Basic theory of the BPNN algorithm. 27
3.4.3 BPNN defect analysis and commonly used improved methods 30
3.4.4 MRBP algorithms 32
4. Text Representation and Feature Reduction 36
4.1 Text Representation 36
4.1.1 Word extraction 36
4.1.2 Stop words removal 37
4.1.3 Word stemming 38
4.1.4 Term weight 40
4.2 Dimensional Reductions 41
4.2.1 The DF method 41
4.2.2 The CF-DF method 43
4.2.3 The TFxIDF method 44
4.2.4 Principal component analysis 45
5. Performance Evaluation and Experimental Results 47
5.1 The Reuters Collection 47
5.2 Evaluation Measures 52
5.3 Experimental Results 53
5.3.1 Experimental design 53
5.3.2 Experimental results 55
6. Conclusions and Future Work 59
Reference 60
List of Figures
2.1. The architecture of text categorization 7
2.2. Classifier for the wheat category in the Construe system. 13
2.3. A decision tree equivalent to the DNF rule of Figure 2.2. 16
3.1. The architecture of the MOPL. 27
3.2. Typical three layers BP network 28
5.1. An example of a document from the Reuters-21578 corpus 51
5.2. Mean absolute error reduce during training with three methods 56
List of Tables
5.1. Some statistics for the Reuters-21578 corpus 48
5.2. A list of the 90 Reuters classes (topics) in the ModApte split 50
5.3. Network size and parameter 57
5.4. Computation time of the networks 57
5.5. Performance of MOLP Algorithm 57
5.6. Compared the performance of three kinds of networks 58
Abstract
Li Chenghua
Department of Information and Communication
Chebuk National University
Text categorization is an important application of machine learning to the field of document information retrieval. This thesis described two kinds of neural networks for text categorization, multi-output perceptron learning (MOPL) and back propagation neural network (BPNN). BPNN has been widely used in classification and pattern recognition. However it has some generally acknowledged defects, usually these defects evolve from some morbidity neurons In this thesis I proposed a novel adaptive learning approach for text categorization using improved back propagation neural network. This algorithm can overcome some shortcomings in traditional back propagation neural network such as slow training speed and easy to get into local minimum. We compared the training time and performance and test the three methods on the standard Reuter-21578. The results show that the proposed algorithm is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.
요약
문서분류는 정보검색에서 기계학습을 응용하는 중요한 분야이다. 본 논문에서는 다중출력 퍼셉트론 학습(Multi-Output Perceptron Learning:MOPL)과 백 프로퍼게이션 신경망(Back Propagation Neural Network:BPNN) 두 가지의 신경망 이론을 문서분류에 적용하였다.
BPNN은 분류와 패턴인식에 많이 사용되고 있지만, 치명적인 신경을 포함하는 몇 가지 결점이 있다. 본 논문에서는 향상된 백 프로퍼게이션 신경망이론을 사용한 새로운 학습법을 제안할 것이다. 이 알고리즘은 기존의 백 프로퍼게이션 신경망의 느린 학습 속도와 쉽게 국소적인 제한치로 빠지는 문제를 개선할 수 있다. 로이터 자료(Reuter-21578)을 이용하여 세 가지 방법을 테스트하고, 학습시간과 성능을 비교하였다. 정확율, 재현율, 그리고 F-mesure를 통하여 본 논문에서 제안한 문서분류 알고리즘의 높은 성능을 확인할 수 있다.
Chapter 1
1. Introduction
1.1 Text Categorization
With the current explosive growth of internet usage, extracting the accurate information that people need quickly is becoming harder and harder. The demand for fast and useful access to online data is increasing. Text categorization is an efficient technology for handling and organizing the text data. This thesis is about the automated categorization of texts into categories of certain topics. The subject goes at least back to the 1960's [1]. Since the 1960s we have seen an immense growth in the production and availability of digital libraries, news sources and online documents. As a result, automated text categorization has witnessed an increased and renewed interest. A generally accepted definition of text categorization is：
“Text Categorization (TC) is the task of deciding whether a piece of text belongs to any of a set of prescribed categories. It is a generic text processing task useful in indexing for later retrieval, as a stage in natural language processing systems, for content analysis, and in many other roles.” Lewis
The core problem in automated text categorization is this: how can documents be assigned to a category with a highest possible chance of being correct without assigning too many incorrect categories, and at acceptable computational costs? Machine learning paradigm [2] has emerged as one of the main approaches in the area. The machine learning approach to text categorization consists of a general inductive process which automatically builds a classifier based on the characteristics of each of the categories. These characteristics are learned from documents already classified and then utilized on the documents to be classified. Advantages of this approach include domain independence and resource saving by focusing on construction of systems instead of the classifier.
1.2 Neural Networks for Information Retrieval
In recent years, more and more researchers have proposed the application of artificial neural networks for information retrieval tasks. Due to the diversified background among different researchers, there are considerable differences between the approaches taken. Even though there are wide variations between the different approaches, the neural network models proposed for information retrieval tasks can be broadly divided into two main classes.
The first class of neural network models mostly domain independent models used in many different domains. These include the Adaline and networks trained by Back-propagation. These models are not specifically designed for the information retrieval tasks. During training, the weights of the network connections are initialized to random values, and the networks learn to perform the particular IR task by cycling through the set of training examples provided an adjusting the connection weights accordingly.
On the other hand, the second class of neural networks is mostly ad hoc approaches designed specifically for solving information retrieval problems and usually differs considerably from domain independent neural network models. In most of these networks, the network units correspond to typical information retrieval objects, such as documents queries, and indexing terms. Connections exist between different units with weights indicating how close the objects are related to each other. The most important characteristics which distinguishes these neural networks from the first class of networks is that the connection weights are initialized based on some domain knowledge gathered from more traditional IR techniques. We refer to this method of weight initialization based on pre-existing domain knowledge as pre-programming. According to the different approaches, we classify the two kinds of models as neural networks without pre-programming and neural networks with pre-programming.
There have some related works on neural network Applications in IR. An early work on the application of neural networks to information retrieval was presented by Belew [3], in his design of the AIR system. In this work, Belew developed a three layer neural network of authors, index terms, and documents. The system used relevance feedback from its users to change its representation of authors, index terms and documents shared by some group of users. The learning process created many new connections between documents and index terms and used a modified correlational learning rule. Rose & Belew [4] extended AIR to a hybrid connectionist and symbolic system called SCALIR which used analogical reasoning to find relevant documents for legal research.
Wong, Cai & Yao [5] used a three layer feed-forward neural network to compute term associations based on an adaptive bilinear retrieval model. In this work, each document and query is represented by a node. The document vectors are input to the network. The nodes in the input layer represent the document terms that are connected to the document nodes. The nodes in the hidden layer represent query terms. These nodes are connected to the document nodes. The output layer is just one node. They showed that a reduced network with only 200 terms (instead of 1217) performs equivalently to one using all the terms of the collection.
Lin, Sorgel, & Marchionini [6] used a Kohonen network for information retrieval. A Kohonen feature map, which produced a two-dimensional representation for N-dimensional features, was applied to construct a self-organizing visual representation for input documents. The input to this network was the document vector and the output was a set of 140 cells arranged in a 14x10 gird. After 2500 iterations, the system classified 140 documents that produced a bi-dimensional map. This gird or bi-dimensional map was used for information retrieval.
MacLeod & Robertson [7] used a neural network algorithm for document clustering. The algorithm compared favorably with conventional hierarchical clustering algorithms. Chen & Lynch [8] used a blackboard architecture that supported browsing and automatic concept exploration using a Hopfiled's neural network parallel relaxation method to facilitate the use of existing thesauri.
In Chen & Ng [9], the performance of a branch-and-bound serial search algorithm was compared with that of the parallel Hopfield network activation in a hybrid neural-semantic network (one neural network and two semantic networks). Both methods achieved similar performance, but the Hopfield activation method appeared to activate concepts from different networks more evenly. Lin, & Chen [10] use a similar Hopfield neural network to perform concept clustering in bilingual (Chinese-English) documents. The concept space that the system generates can be used for categorization or retrieval.
As we have seen in this review, the earliest works tried to apply feed-forward algorithms and represent the three basic elements of an information retrieval system (documents, queries and index terms) as individual layers in the neural network. The other big category of neural network applications involves performing more specific tasks such as conceptual clustering [9-10], document clustering [7] and concept mapping [6]. All of these methods have been tested in small collection of few hundreds of documents.
1.3 Outline
The rest of this thesis can organized as following chapters. In chapter2, an introduction to architecture and approaches in text categorization In chapter3, we will give the basic ANN algorithms and then propose our modified BPNN Algorithm .In chapter 4, text representation and dimensionality reduction have been introduced In chapter 5, procedures for performance evaluation an experimental results will discussed and analysis. Chapter 6.is the concluding chapter we will summarize the contributions of this research and some of the ways that the research can be extended in the future work.
Chapter 2
2. Architecture, Applications and Approaches in Text Categorization
2.1 The Architecture of the Text Categorization
In this section, we give an overview of the architecture of text categorization and describe the functionality of each component in text categorization. Text categorization usually comprises three key components: data pre-processing, classifier construction, and document categorization. Data pre-processing implements the function of transferring initial documents into a compact representation and will be uniformly applied to training, validation, and classification phases. Classifier construction implements the function of inductive learning from a training dataset, and document categorization implements the function of document classification. All three components together make text categorization practicable.
Fig 2.1 The architecture of text categorization
In Fig. 2.1, the arrow with dashed line represents the data flow in the categorization process and the arrow with the solid line represents the data flow in the classifier construction process.
Data input comprise training data and test data, and data pre-processing comprises six sub-components including document conversion function word removal, word stemming, feature selection, dictionary construction, and feature weighting. The functionality of each component is described as follows:
(1) Document converting –converts different types of documents such as XML, PDF, HTML, DOC format to plain text format.
(2) Function word removal –removes topic-neutral words such as articles (a, an, the), prepositions (in, of, at), conjunctions (and, or, nor), etc. from the documents.
(3) Word stemming –standardizes word’s suffixes (e.g., labeling -- label, introduction --introduce).
(4) Feature selection – reduces the dimensionality of the data space by removing irrelevant or less relevant features. In our prototype, we choose information gain as a feature selection criterion.
(5) Dictionary construction –constructs a uniform dictionary, which is used as a reference for converting the text document to a vector of features. Each feature in the vector corresponds to a word in the dictionary.
(6) Feature weighting –assigns different weights to words in the dictionary.
Each document will be converted into a compact representation and will be applied to training, validation, and classification phases.
2.2 Applications of Text Categorization
Automatic TC has been used in a number of different applications. In the following, we briefly review the most important ones.
2.2.1 Automatic indexing for Boolean information retrieval systems
The first use to which automatic text classifiers were put at, and the application that spawned most of the early research in the field that of automatic document indexing for use in information retrieval (IR) systems relying on a controlled dictionary. The most prominent example of such IR systems is, of course, that of Boolean systems. In these systems, each document is assigned one or more keywords or key phrases describing its content, where these keywords and key phrases belong to a finite set of words called controlled dictionary and often consisting of a hierarchical thesaurus (e.g. the NASA thesaurus for the aerospace discipline, or the MESH thesaurus covering the medical field).
Usually, this assignment is performed by trained human indexers, and is thus an extremely costly activity.
If the entries in the thesaurus are viewed as categories, document indexing becomes an instance of the document categorization task, and may thus be addressed by the automatic techniques described in this paper, note that in this case a typical constraint may be that k1 ≤ x ≤ k2 keywords are assigned to each document, for given k1, k2. Document-pivoted categorization might typically be the best option, so that new documents may be classified as they become available. Various automatic document classifiers explicitly addressed at document indexing applications have been described in the literature.
The issue of automatic indexing with controlled dictionaries is closely related to the topic of automated metadata generation. In digital libraries we are usually interested to tag documents by metadata that describe them under a variety of aspects (e.g. creation date, document type or format, availability, etc.). Usually, some of these metadata are thematic, i.e. their role is to describe the semantics of the document by means of bibliographic codes, keywords or key phrases. The generation of these metadata may thus be viewed as a problem of document indexing with controlled dictionary, and thus tackled by means of automatic TC techniques.
An example system for automated metadata generation by TC techniques is the klarity system (http://www.topic.com.au/products/klarity.html).
2.2.2 Document organization
In general, all issues pertaining to document organization and filing, be it for purposes of personal organization or document repository structuring, may be addressed by automatic categorization techniques. For instance, at the offices of a newspaper, incoming “classified” ads must be, prior to publication, categorized under the categories used in the categorization scheme adopted by the newspaper; typical categories might be e.g. Personals, Cars for Sale, Real Estate. While most newspapers would handle this application manually, those dealing with a high daily number of classified ads might prefer an automatic categorization system to choose the most suitable category for a given ad. In this case a typical constraint might be that exactly one category is assigned to each document. A first come first-served policy might look the aptest here, which would make one lean for a document-pivoted categorization style. Similar applications might be the automatic filing of newspaper articles under the appropriate sections (e.g. Politics, Home News, Lifestyles, etc.), or the automatic grouping of conference papers into sessions.
Document organization, both in the cases of paper documents and electronic documents, often has the purpose of making document search easier. An interesting example of this approach is the system for classifying and searching patents of the U.S. Patent and Trademark Office. In this system documents describing patents are classified according to a hierarchical set of categories. Patent office personnel may thus search for existing patents related to a claimed new invention with greater ease.
2.2.3 Document filtering
Document filtering refers to the activity of classifying a dynamic collection of documents, typically in the form of a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer [11]. A typical case of this is a news feed, whereby the information producer is a news agency (e.g. Reuters or Associated Press) and the information consumer is a newspaper. In this case, the filtering system should block the delivery to the consumer of the documents the consumer is not likely to be interested in (e.g. all news not concerning sports, in the case of a sports newspaper). Filtering can be seen as a case of single-label categorization, i.e. the categorization of incoming documents in two disjoint categories, the relevant and the irrelevant. Additionally, a filtering system may also perform a further categorization into topical categories of the documents deemed relevant to the consumer; in the example above, all articles about sports are deemed relevant, and should be further subcategorized according e.g. to which sport they deal with, so as to allow individual journalists specialized in individual sports to access only documents of high prospective interest for them. Similarly, an e-mail filter might be trained to further classify previously filtered e-mail into topical categories of interest to the user.
A document filtering system may be installed at the producer end, in which case its role is to route the information to the interested consumers only, or at the consumer end, in which case its role is to block the delivery of information deemed uninteresting to the user. In the former case the system has to build and update a “profile” for each consumer it serves, whereas in the latter case (which is the more common, and to which we will refer in the rest of this section) a single profile is needed.
A profile may be initially specified by the user, thereby resembling a standing IR query, and is usually updated by the system by using feedback information provided by the user on the relevance or non-relevance of the delivered messages. In the TREC community [12] this is called adaptive filtering, while the case in which no user-specified profile is available is called either routing or batch filtering, depending on whether documents have to be ranked in decreasing order of estimated relevance or just accepted/rejected.
In information science document filtering has a tradition dating back to the ’60s, when, addressed by systems of varying degrees of automation and dealing with the multi-consumer case discussed above, it was variously called selective dissemination of information or current awareness. The explosion in the availability of digital information, particularly on the Internet, has boosted the importance of such systems. These are nowadays being used in many different contexts, including the creation of personal Web newspapers, “junk e-mail” blocking, and the selection of Usenet news.
2.2.4 Word sense disambiguation
Word sense disambiguation (WSD) refers to the activity of finding, given the occurrence in a text of an ambiguous word, the sense this particular word occurrence has. For instance, the English word bank may have (at least) two different senses, as in the Bank of England (a financial institution) or the bank of river Thames (a hydraulic engineering artifact). It is thus a WSD task to decide to which of the above senses the occurrence of bank in Last week I borrowed some money from the bank refers to. WSD is very important for a number of applications, including indexing documents by word senses rather than by words for IR or other content-based document management applications.
WSD may be seen as a categorization task once we view word occurrence contexts as documents and word senses as categories. Quite obviously, this is a case in which exactly one category needs to be assigned to each document, and one in which document-pivoted categorization is most likely to be the right choice. WSD is viewed as a TC task in a number of different works in the literature.
WSD is just an example of the more general issue of resolving natural language ambiguities, one of the most important problems in computational linguistics. Other instances of this problem, which may all be tackled by means of TC techniques along the lines discussed for WSD, are context-sensitive spelling correction, prepositional phrase attachment, part of speech tagging, and word choice selection in machine translation.
Fig. 2.2. Classifier for the wheat category in the Construe system; keywords are indicated in italic, and Wheat is the category.
2.2.5 Yahoo!-style search space categorization
Automatic text categorization has recently aroused a lot of interest also for its possible Internet applications. One of these is automatically classifying Web pages, or sites, into one or several of the categories that make up commercial hierarchical catalogues such as those embodied in YAHOO! INFOSEEK, etc. When Web documents are catalogued in this way, rather than addressing a generic query to a general-purpose Web search engine, a searcher may find it easier to first navigate in the hierarchy of categories and then issue her search from (i.e. restrict her search to) a particular category of interest.
Automatically classifying Web pages has obvious advantages, since the manual categorization of a large enough subset of the Web is problematic to say the least. Unlike in the previous applications, this is a case in which one might typically want each category to be populated by a set of k1≤ x≤ k2 documents, and one in which category-centered categorization may be aptest so as to allow new categories to be added and obsolete ones to be deleted.
2.3 Some Well-known Approaches in Text Categorization
2.3.1 k Nearest Neighbor algorithm
The k Nearest Neighbors algorithm (kNN)[13] is a similarity-based learning algorithm that has been shown to be very effective for a variety of problem domains including text categorization. This approach has become a standard within the field of text categorization and is included in numerous experiments as a basis for comparison, the main reason being that it is by far the better performing algorithm. kNN takes an arbitrary input document and ranks the k nearest neighbor among the training documents through the use of a similarity score. It then adapts the category of the most similar document or documents; k denotes the number of neighbors included in the evaluation. As our documents can have more than one category assigned to them, we will try a number of ways of selecting categories from kNN. Given a test document, the k-NN algorithm finds the k nearest neighbors among the training documents, and uses the categories of the k neighbors to weight the category candidates.
（1）
Where indicates whether document belongs to class (y=1 for YES, and y=0 for NO). The decision function assigns which has the most common value of among the k training examples nearest to as an answer categorization.
2.3.2 Rocchio algorithm
The Rocchio method is a linear classifier. This approach makes a centroid document for each category and then ranks the categories, through the use of similarity scores according to which one compares each feature of the category's centroid document to the features of the test document. Given a training dataset Tr, it directly computes a classifier for category by means of the formula:
（2）
Where is the weight if the term in the document , and , =T (or F) means document belongs to (or not belongs to) category . and are two control parameters used for setting the relative importance if positive and negative instances. The profile of is the centroid of its positive training examples. A classifier built by means of the Rocchio method rewards the closeness of a test document to the centroid of the positive training instances, and its distance from the centroid of the negative training instances [14].
2.3.3 Support vector machine algorithm
SVM classification algorithms, proposed by Vapnik[15] to solve two-class problems, are based on finding a separation between hyper-planes defined by classes of data. This means that the SVM algorithm can operate even in fairly large feature sets as the goal is to measure the margin of separation of the data rather than matches on features. The SVM is trained using pre-classified documents. Research has shown [16] that SVM scales well and has good performance on large data sets. SVM has been introduced into TC by Joachims [17] and subsequently used by many TC researchers. Let be a set of n instances for training, where , and category . SVM learns linear decision rules described by a weight vector and a threshold . If is linearly separable, SVM finds the hyper-plane with maximum Euclidean distance to the closest training instances. If is non-separable, the amount of training error is measured using slack variables . Computing the hyper-plane is equivalent to solving the following optimization problem [18].
Minimize: （3）
Subject to: （4）
（5）
The factor C in（3） is a parameter used for trading off training error vs. model complexity The constraints (4) require that all training instances be classified correctly up to some slack .
2.3.4 Decision tree classifiers
Probabilistic induction methods are essentially quantitative (i.e. numeric) in nature, and as such have sometimes been criticized since effective as they may be, are not readily interpretable by humans. The algorithms that do not suffer from this problem are symbolic (i.e. non-numeric) algorithms, among which inductive rule learners and decision tree inducers are the most important examples.
Fig.2.3. A decision tree equivalent to the DNF rule of Figure 2.2. Edges are labeled by terms and leaves are labeled by categories (underlining denotes negation).
A decision tree text classifier consists of a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that the term has in the representation of the test document, and leaf nodes are labeled by (not necessarily different) categories. Such a classifier categorizes a test document by recursively testing for the weights that the terms labeling the internal nodes have in the representation of , until a leaf node is reached; the label of this leaf node is then assigned to . Most such text classifiers assume a binary document representation, and thus consist of binary trees. An example of such a tree is illustrated in Figure 2.3.
A possible procedure for the induction of a decision tree for category from a set of training examples consists in a “divide and conquer” strategy of recursively: (i) checking whether all the training examples have the same label (either or ); (ii) if not, selecting a term , partitioning the training examples into classes of documents that have the same value for , and placing each such class in a separate sub-tree. The process is recursively repeated on the sub-trees until each leaf node of the tree so generated contain training examples assigned to the same category , which is then chosen as the label for the leaf node. The key step of this process is the choice of the term on which to operate the partition, a choice which is generally made according to an information gain or entropy criterion. However, such a “fully grown” tree may be prone to over-fitting, as some branches may be excessively specific to the training data. Any decision tree induction method thus includes a method for growing the tree and one for pruning it, i.e. for removing the overly specific branches so as to minimize the probability of misclassifying test documents. Variations on this basic schema for tree induction abound; the interested reader is referred to [19].
2.3.5 Neural networks
A neural network classifier is a network of units, where the input units usually represent terms, the output units represents the category or categories of interest, and the weights on the edges that connect units represent conditional dependence relations. For classifying a test document , its term weights are assigned to the input units; the activation of these units is propagated forward through the network, and the value that the output units take up as a consequence determines the categorization decisions. A typical way of training neural networks is back-propagation, whereby the term weights of a training document are loaded into the input units, and if a misclassification occurs the error is “back-propagated” so as to change the parameters of the network and eliminate or minimize the error.
The simplest type of neural network classifier is the perceptron[20]. Other types of linear neural network classifiers implementing a form of logistic regression have also been proposed and experimented with by Schutze[21] and Wiener[22], and they have usually given very good effectiveness results.
A non-linear neural network [21-22] is instead a network with one or more additional “layers” of units, which in TC usually represent higher order interactions between terms that the network is able to learn. When comparative experiments relating non-linear neural networks to their linear counterparts have been performed, the former have yielded either no improvement [21] or very small improvements [22] over the latter.
Chapter 3
3. Basic ANN Algorithms and Improved BPNN Algorithm
3.1 Artificial Neural Networks
An artificial neural network is an information-processing system that has certain performance characteristics in common with biological neural networks. Artificial neural networks have been developed as generalizations of mathematical models of human cognition or neural biology, based on the assumptions that:
1 Information processing occurs at many simple elements called neurons.
2 Signals are passed between neurons over connection links.
3 Each connection link has an associated weight, which, in a typical neural net, multiplies the signal transmitted.
4 Each neuron applies an activation function (usually nonlinear) to its net input (sum of weighted input signals) to determine its output signal.
A neural network is characterized by (1) its pattern of connections between the neuron (called its architecture), (2) its method of determining the weights on the connections (called its training, or learning, algorithm), and (3) its activation function.
Earlier work in neural network research dated back to the 1940s, when McCulloch an Pitts described a simplified model of a neuron [23] Over the years, there have been a lot of proposals on different neural computation models. However, there are some basic elements common to most of these models. In the following sections, we will try to summarize these elements with the aim of giving the readers a better picture of what neural networks are from the computer science perspective.
3.2 Neural Network Topologies
In many networks, the neurons are arranged in a number of layers. Neural networks having more than one layer of neurons are sometimes called multi-layer neural networks. Based on the way the neuron layers are connected together in a neural network, or the network topology, we can broadly classify the various neural networks into two classes: feed-forward networks and recurrent networks.
3.2.1 Feed-forward networks
In a multi-layer feed-forward neural network, the direction that the signals flow through the connections is strictly feed-forward, from one layer to a following layer. Although the signal flow can continue across multiple layers of neurons, there must not exist any feedback connections, going from one layer to a preceding layer. Moreover, there should not be any intra-layer connections between the neurons of the same layer. Since loops are not possible in a feed-forward neural network, the number of layers or depth of the network. After these finite numbers of steps of signal flow, all the activation values in the network will be known.
In a feed-forward network, there is usually a layer called the input layer, which accepts external signals as input to the neural network. Also, there is one other layer called the output layer in which the activation values of the neurons in this layer will be taken as output if the neural network. The layers in between the input layer and the output layer are sometimes called hidden layers. Consequently, neurons in the input layer are called input units, while those in the output layer and a hidden layer are called output units and hidden units, respectively.
Some examples of feed-forward neural networks are Perceptrons, Adaline[24], and non-linear feed-forward networks trained by Back-propagation[25].
3.2.2 Recurrent networks
As the name suggests, recurrent neural networks are networks that contain feed-back connections. There are many types of recurrent neural networks proposed in the literature, including the well-known model proposed by Hopfield[26], often called the Hopfield Network.
Unlike feed-forward networks, there are loops in recurrent networks. As a result signal flow may continue to any number of steps. In this case, the network is said to undergo a relaxation process until the activation values converge to a stable state. In some cases, this stable state may not be reached and the network will continue to change its activation values without becoming stable. When the network does become stable, the set of activation values at the stable state represents the output of the network.
3.3 Learning in Artificial Neural Networks
The ability to learn or adapt is often considered one of the most important properties that has attracted so much interest in neural networks. It can be defined as the ability of the module to adjust itself such that the self-adjustment will enable the module to perform a given task or a set of tasks with improved performance over time. Depending on whether an external teacher is present, we can have two different learning paradigms:
3.3.1 Supervise learning
Supervise learning refers to the learning paradigm in which an external teacher is present. The role of this external teacher is to give feedback to the learning module about the appropriateness of tits performance with respect to some given tasks and goals. In the case of neural network learning, this means that the network is given a set of input-output pairs as training examples. In order to learn, the neural network tries to make adjustment of its connection weights according to some learning rule, such that when given any of the training inputs, the network will eventually give outputs that match the corresponding training outputs.
3.3.2 Unsupervised learning
In unsupervised learning, there is no external teacher giving feedback to the network. In this paradigm, the neural network makes adjustments to itself based only on a set of input patterns given to it. In other words, a training example given to the network consists of only an input. One common application which often makes use of unsupervised learning in neural networks is in clustering a set of input patterns. In this case, the network learns to discover correlations and regularities in the inputs, and adjusts its response with respect to the inputs accordingly. Unsupervised learning is sometimes referred to as self-organization.
In the thesis, we are only interested in the supervised learning paradigm in neural networks, as it is the learning paradigm for the Perceptron and Back-propagation learning rule on which our text categorization model is based.
3.4 Theory of MOPL and BPNN Algorithms
3.4.1 Basic theory of MOPL algorithm
Perceptron had perhaps the most far-reaching impact of any of the early neural networks. The perceptron learning rule is a more powerful learning rule than the Hebb rule. Single neuron can be used to classify two categories, and the pattern must be linear separable, when the categories is more then two, the MOPL can be used. The architecture of the MOPL is shows in Fig. 3.1. In the network, there is an input layer, an output layer. The weights from the input layer to output layer were adjusted by the perceptron learning rule. For each training input, the net would calculate the response of the output neuron. Then the net would determine whether an error occurred for this pattern (by comparing the calculated output with the target value). The activation of the each output neuron can be computed as:
（6）
Where is the bias. is the weight form input layer to output layer. And the activation function is:
（7）
Where is the threshold. If an error occurred for a particular training input pattern, that is . Then the weights and the biases would be changed according to the formulas,
（8）
（9）
Where is the learning rate and is the target value +1 or -1. If an error did not occur, the weights and biases would not be changed. Training would continue until no error occurred.
Fig.3.1. the architecture of the MOPL
3.4.2 Basic theory of the BPNN algorithm.
BPNN is a generalization of the delta rule for training multi-layer feed-forward neural networks with non-linear units. It is simply a gradient descent method to minimize the total error (or mean error) of the output computed by network. Fig. 3.2 shows such a network.
Fig. 3.2. Typical three layers BP network
In the network, there is an input layer, an output layer, and one or more hidden layers in between.
During training, the network is given an input pattern to the input layer. Based on the given input pattern, the network will compute the output in the output layer. This network output is then compared with the desired output pattern. The aim of the back-propagation learning rule is to define a method to adjust the weights of the networks. Eventually the network will give output that matches the desired output pattern given any input pattern in the training set.
The training of a network by back-propagation involves three stages: the feed-forward of the input training pattern, the calculation and back-propagation of the associated error, and the adjustment of the weight and the biases. Such stage is explained in detail as follow:
Input pattern feed-forward. Calculate the neuron’s input and output. For the neuron , the input and output are
（10）
（11）
Where is the weight of the connection from the neuron in the previous layer to the neuron , is activation function of the neurons, and the , are the output of previous neuron and the neuron , is the biases input to the neuron.
Error calculation. Calculate the total absolute error in output layer as the following formula.
（12）
And the mean absolute error is
（13）
Where is the number of training patterns. The absolute error is used to evaluate the learning effects, the training will keep up until the absolute error falls below some threshold or tolerance level. Calculate the back propagation error both in output layer and hidden layer as following formulas:
（14）
（15）
Where is the desired output of the output neuron, is the actual output in the output layer, is the actual output value in the hidden layer, and is the adjustable variable in activation function. The back propagation error is used to update the weights and biases both in output layer and hidden layer.
Weights and biases adjustment. The weights and biases adjust as the following formulas:
（16）
（17）
Where the k is number of epoch, is the learning rate.
3.4.3 BPNN defect analysis and commonly used improved methods
The three main defects of BPNN and some common improved methods are as follows:
Slow training speed. In the beginning of the learning, it goes very fast, in each epoch, it can make a big progress, and it will slow down in the later [27]. There are two commonly used methods of improving the speed of training for BPNNs. a) Introduce the momentum into the network. Convergence is sometimes faster if a momentum term is added to the weight update formulas. The weight update formula for a BPNN with momentum is:
（18）
Where the momentum parameter is constrained to be in the range from 0 to 1, the new weights for the training step t+1 are based on the weights at training steps t and t-1. b) Using the adaptive learning rate to adjust the weight. The role of the adaptive learning rate is to allow each weight to have its own learning rate, and to let the learning rates vary with time as training progresses. The formula for a BPNN with an adaptive learning rate is:
（19）
Where is the epoch during the training process, and is the absolute error in each epoch. When E decreases, the learning effect will increase (the weight may change to a greater extent), otherwise the learning effect will decrease.
These two methods accelerate the convergence of the BPNN, but they cannot solve the other problems associated with the BPNN, especially when the size of the network is large.
Local minimum. When training a BPNN, it is easy to enter into a local minimum, and usually the GA and simulated annealing algorithms have been used to solve this problem. These algorithms can prevent the problem of entering into a local minimum, but they cannot ensure that the network will not enter into a global minimum, and they are even slower than the traditional BPNN.
Network paralyses. During training, the value of the weights may be very large, and, consequently, the input of the network will be very large. Thus, the output value of the activation function （or ）tends to 1, according to the formula of error back propagation, and, the back propagation error will tends to 0. This phenomenon is referred to as saturation. The speed of training becomes very slow when saturation occurs. Finally it will cause the weight not to change any more, and this network will lead to network paralysis. P.D. Wasserman [28] provided the suggested formula to limit the weight between (-a, a), but it is only used for weight initialization. It cannot prevent the value of the weight increasing during training, and it also has the possibility of leading to network paralysis.
3.4.4 MRBP algorithms
The defects mentioned above are all related to saturation. In the case of saturation, the convergence will become slower and the system will change to a higher learning rate. Also, the weight becomes larger due to the larger learning rate, and this will cause the output value of the activation function to tend to 1. Under this situation, the network can easily enter into a local minimum and ultimately become entrapped by network paralysis. Based on our experience with such problems, we also found that there is another phenomenon which can cause such defects. For some of the neurons, the range of input values is restricted to a small range during each epoch, and this causes the values of the output to be extremely close to each other at each epoch, while the error during each epoch changes slowly. In other words, the speed of convergence is slow. Finally, this situation causes a local minimum or even network paralysis. In this paper, we refer to these two kinds of phenomena as neuron overcharge and neuron tiredness respectively. We call these neurons morbidity neurons. In general, if some morbidity neurons occur within it, then the network cannot function effectively.
The MRBP improved method: During the learning process, neurons face two kinds of morbidity: overcharge and tiredness. If we avoid the appearance of morbidity neurons during the learning phase or rectify the problem in time, then the networks can train and evolve effectively.
[Definition 1]: Neuron overcharged. If the input value of the neuron is very big or very small, it will cause the output value to tend to -1 or 1, and cause the back propagation error to tend to 0, we call that neuron overcharged. For the activation function,
（20）
If due to
（21）
Result in then we called the neuron overcharged.
[Definition 2]: Neuron tiredness. If a certain neuron always received the similar stimulation, then its response to this stimulation will be very similar, so that it is difficult to distinguish the different stimulation by its response. Then we called the neuron tired. That is when neuron j during one learning phase (defined as follows) obeys,
（22）
Then the neuron is tired.
[Definition 3]: Learning phase. Choosing N iterations (or leanings) as a period, during this period we record some important data, and calculate the effect of the learning process, as the direction for the next period. We called this period the learning phase, and based on our experience, we use 50 epochs as a learning phase.
According to the definition of the neuron overcharged and neuron tired. We know that they are directly related to the activation function. In the conventional activation function,
（23）
using 1 or other constants, whereas in our model, is an adjustable variable. V.P.Plagianakos [29] tried to use an adjustable value of in his paper. Actually different combination of correspond to different learning models. The determinant rule of the morbidity neuron is:
If 
（24）
Then neuron is overcharged. And if 
（25）
Then neuron is tired.
The formula used to rectify the morbidity neurons are:
 
（27）
Formula (26) is used to normalize the maximum and minimum input values in the previous phase in order to make them symmetric with respect to the origin. Formula (27) is used to limit the maximum and minimum output value to the normal range. In our experiment, the range is (-0.9, 0.9). In our study , the morbidity neurons were rectified in each phase after their evaluation.
Chapter 4
4. Text Representation and Feature Reduction
4.1 Text Representation
4.1.1 Word extraction
A text document can be viewed as a long stream of characters. For the purpose of feature identification, it is necessary to convert a text document from a long stream of characters into a stream of words or tokens, where a word or token is defined as a group of characters with collective significance. In information retrieval, this text tokenization process is often called word extraction, word breaking, word segmentation, or lexical analysis.
Depending on the natural language the text document is written in, the word extraction process can involve very different techniques. This process is relatively easy in languages such as English, where boundaries between words are marked by special word-delimiting characters such as spaces and punctuations. On the other hand, word extraction is particularly difficult for languages such as Chinese, Japanese and Korea, where the orthography does not use spaces between words.
In order to identify the word boundaries between individual words in these languages, more sophisticated techniques must be applied. In this thesis, we concern ourselves only with English documents. As such, the discussions below will be based on techniques applicable to English documents.
In designing the word extractor, we must define what constitutes a word in the operational point of view. That is, which kind of character sequences should be considered as an English word. This decision varies between different word extractor designs, especially on the handling of special characters such as digits and hyphenations.
After word extraction, a document is transformed into a stream of words with the order of word occurrence in the document preserved. However, in our design, as in many other systems, the ordering information of word occurrence is ignored. In this case, the stream of words from word extraction can be treated as an unordered set of words, with the word ordering or positional information discarded.
4.1.2 Stop words removal
It is well recognized among the information retrieval community that a set of functional English words (such as ‘the’, ‘a’, ’and’, ‘that’, etc) is useless as indexing terms. Salton and McGill [30] described these words as having very low discrimination value, since they occur in almost every English document, and therefore do not help in distinguishing between documents with contents that are about different topics. For this reason, these words are not useful in the text categorization task and thus should be removed from the set of words produced by word extraction.
The process of removing the set of non-content bearing functional words from the set of words produced by word extraction in called stop words removal, and the functional words being removed are called stop words. In order to remove the stop words, a semi-automatic procedure is followed. This involves first creating a list of stop words to be removed, which is sometimes called the stop-list or a negative dictionary. After this, the set of words produced by word extraction is then scanned so that every word appearing in the stop-list is removed.
The major difficulty in stop words removal is in deciding which words should be put into the stop-list. This is difficult as the stop words in sometimes dependent on the topics of the given documents. Words that are not normally considered as stop words in general English may be considered as such in specialized document collections. For example, words such as ‘home’, ‘page’, ’world’, ‘web’, which are not considered as stop words in general English may be considered as such in a collection of World Wide Web related documents. In the case where the topics of the documents are unknown during the construction of the system, a stop-list for general English documents is usually employed.
4.1.3 Word stemming
In a text document, a word may exist in a number of morphological variants. For example, the word ’compute’ may also exist in its other morphological variants such as ‘computing’, ‘computed’, ‘computational’, or ‘computer’. While these morphological variants are different word forms, they represent the same concept. For information retrieval tasks, including test categorization, it is generally desirable to combine these morphological variants of the same word into one canonical form. In information retrieval, the process of combining or fusing together different morphological variants of the same word into a single canonical form is called stemming, or conflation. The canonical form produced by stemming is called a stem. The module stemming is sometimes called a stemmer.
There are various approaches to stemming. The easiest way is to build a translation table with two columns corresponding to the original word forms and their stems. Stemming is then carried out by looking up the translation table. In order to improve efficiency, indexing methods such as B-tree or hashing can be employed to build an index of the table entries. However, one obvious drawback of this approach is that the translation table has to be built manually, which requires a lot of efforts and can be time-consuming. Another problem is that it is very difficult to build a translation table extensive enough to include every possible word that may exist in the documents. Some words, especially those that are specialized words in particular domains, are likely to be missed.
In our experiment, we use Porter Stemmer algorithms [31]. The Porter Stemmer is a conflation Stemmer developed by Martin Porter at the University of Cambridge in 1980. The Stemmer is based on the idea that the suffixes in the English language (approximately 1200) are mostly made up of a combination of smaller and simpler suffixes. This Stemmer is a linear step Stemmer. The Porter Stemmer is a very widely used and available Stemmer, and is used in many applications. Porter's algorithm is probably the stemmer most widely used in IR research.
4.1.4 Term weight
After word extraction, stop words removal and stemming, each text document is transformed into a set of stems corresponding to the set of words appearing in each document.. The next step is to find the union of all theses sets, such that the union set contains the set of stems corresponding to all of the words appearing in the given set of documents. Duplicates are removed so that each stem is unique within the union set. In information retrieval, this set of stems constitutes the set of indexing terms of the set of indexing terms for the set of documents. We call this set of indexing terms the indexing vocabulary.
There are many different approaches for determining the term weights. The simplest method involves a binary term weight. That is, them weight ( ) is equal to 0 if the indexing term is not in document j and 1 otherwise. In this case, the feature vector representing a document is a bit vector with each bit corresponding to the term weight of each indexing term for that document. This binary representation for documents has been used quite commonly in learned information retrieval tasks.
Another approach makes use of the term occurrence frequencies information. In this approach, the assumption is that a term that is mentioned more frequently in a document carries more weight in representing the topic of that document, and thus is more important and should be assigned a higher term weight. In this case, the term weight ( ) is a positive integer equal to the term frequency (TF) of the indexing term in document j, which is the number of times that the indexing term appears in document j.
The other approach of the term weight like inverse document frequency (IDF), the product of term frequency (TF) and the inverse document frequency (IDF) are the general approaches in TC and IR.
4.2 Dimensional Reductions
The main difficulty in the application if neural network to text categorization is the high dimensionality of the input feature space typical for textual data. This is because each unique term in the vocabulary represents in one dimension in the feature space, so the size of the input of the neural network depends upon the number of stemmed-words.
In order to improve scalability of the text classifier, four dimensionality reduction techniques, namely the DF method, the CF-DF method, the method and Principal Component Analysis (PCA) [32], are applied to reduce the feature space. The aim of the techniques is to minimize information loss while maximizing reduction in dimensionality.
4.2.1 The DF method
Given a set of training documents together with a specification of which of the pre-defined categories each training document belongs to, the DF method reduces the vocabulary size by term selection based on a local term ranking technique. The categorization information is used for grouping the training documents such that all the documents belonging to the same category are put into the same group.
When there are overlaps between the categories, a document may belong to more than one group. After the documents are grouped, we can then form groups of the indexing terms in the vocabulary by putting in a group all terms contained in documents belonging to the same category. This process results in a set of sub-vocabularies corresponding to each category.
In the DF method, terms are ranked based on the document frequency (DF) of each term within a document group. For each document group, the document frequency of a term is defined as the number of documents within that particular group containing the term. By choosing the document frequency as the importance measure, we are assuming that the important terms are those that appear frequently within a group of documents belonging to the same category. This is because the set of terms which are good representatives of the category topics should be used by most documents belonging to that category.
Based on the DF importance measure, terms are ranked separately within each sub-vocabulary. For term selection, a parameter is defined such that within each sub-vocabulary, only the most important terms with the highest ranks are selected. The sets of selected terms from each sub-vocabulary are then merged together to form the reduced feature set.
By adjusting the selection parameter , we can control the dimensionality of the reduced feature vectors. A smaller will result in fewer terms being selected, thus higher reduction
in dimensionality.
4.2.2 The CF-DF method
From the discussion of the DF method, we observe that terms that appear in most documents within the whole training set will always have a high within-group document frequency. Even though these frequently occurring terms are of very low discrimination value, and thus not helpful in distinguishing between documents belonging to different categories, they are likely to be selected by the DF method. The CF-DF method alleviates the problem by considering the discrimination value of a term in the term selection process.
In the CF-DF method, a quantity called category frequency (CF) is introduced. To determine the category frequency of a term, the training documents are grouped according to the categorization information, as in the DF method. For any document group, we say that a term appears in that group if at least one of the documents in that group contains that term. For any term in the vocabulary, the category frequency is equal to the number of groups that the term appears in.
By this definition, terms that are concentrated in a few categories will have a low category frequency, while those that are distributed across a large number of categories will have a high category frequency. The idea is that the discrimination value of a term can be measured as the inverse of its category frequency. In other words, we assume that terms that are good discriminators are most likely concentrated in a few categories, and should be considered more important as they are helpful in distinguishing between documents belonging to different categories.
In the CF-DF method, a two phase process is used for term selection. In the first selection phase, we define a threshold on the category frequencies of the terms, such that a term is selected only if its category frequency is lower than the threshold . In the second selection phase, the DF method is applied for further term selection to produce the reduced feature set.
4.2.3 The TFxIDF method
In the DF method and the CF-DF method, the essential idea is to perform ranking of the terms in the vocabulary based on some importance measure, such that the most important terms can be selected. In both of these methods, the key to minimize information loss as a result of term selection is to define a good importance measure so as to avoid filtering out terms that are useful for the text categorization task. A good measurement of the importance of a term in a document set is the product of the term occurrence frequency (TF) and the inverse document frequency (IDF). The inverse document frequency of the term is commonly defined as:
(28)
Where is the number of documents in the document set, and is the number of documents in which the term appears. By this definition, a term that appears in fewer documents will have a higher IDF. The assumption behind this definition is that terms that are concentrated in a few documents are more helpful in distinguishing between documents with different topics.
In order to examine the effectiveness of this measure for term selection, we propose to use the value to measure the importance of a term for term selection. In other words, the terms are ranked according to their values, and a parameter is set such that only the terms with the highest values are selected to form the reduced feature set.
4.2.4 Principal component analysis
Principal component analysis (PCA) [32] is a statistical technique for dimensionality reduction which aims at minimizing the loss in variance in the original data. It can be viewed as a domain independent technique for feature extraction, which is applicable to a wide variety of data. This is in contrast with the other three dimensionality reduction techniques we have discussed, which are domain specific feature selection techniques based on feature importance measures defined specifically for textual data.
In order to perform principal component analysis on the set of training documents, we represent the set of feature vectors by an -dimensional random vector ( ):
where is the vocabulary size, and the random variable in takes on values from the term frequencies of the term in the documents. We now find a set of -dimensional orthogonal unit vectors, to form an orthonormal basis for the -dimensional feature space. We form projections of onto the set of unit vectors:
In doing so, we perform a coordinate transformation in the feature space, such that the unit vectors form the axes of the new coordinate system and transform the original random vector into a new random vector with respect to the new coordinate system: .
In principal component analysis, the choice of the unit vectors is such that the projections are uncorrelated with each other. Moreover, if we denote the variance of by , for , then the following condition is satisfied: .
In other words, the projections contain decreasing variance, these projections are called the principal components. It can be shown [32] that the variance corresponds to the eigenvalues of the data covariance matrix A arranged in descending order, and the unit vectors are the corresponding eigenvectors of R. In order to reduce the dimensionality of the feature space from to where while minimizing the loss in data variance, we form a reduced feature space by taking the first dimensions with the largest variance. In this case, the reduced feature vectors of the documents are represented by the dimensional random vector: .
Chapter 5
5. Performance Evaluation and Experimental Results
5.1 The Reuters Collection
For the evaluation and test of different categorizing approaches we will therefore use a widely acknowledged [33, 34] and distributed corpus originally used for information retrieval purposes, namely, the Reuters-21578 collection [35]. In recent years the collection has been modified to fit text categorization purposes. Reuters-21578 consists of 21578 economic news stories that originally appeared on the Reuters newswire in 1987. Each story has been manually assigned one or more indexing labels from a fixed list, making Reuters-21578 an excellent example of a real-world classification task. The labels are broken into categories such as TOPIC, PEOPLE, and PLACES, but the methodology of most researchers is to consider only the 135 TOPIC labels for classification. Like most text classification problems, these TOPIC labels form a set of binary classification tasks in which the choice is whether to label each example as a positive or negative example of each class. All of these tasks involve positive classes represented by a small minority of the total training data.
The Reuters corpus was originally developed in the context of a knowledge-engineering approach to classification. Topic categories were defined as part of the process of knowledge extraction, resulting in CONSTRUE - a system of rule-based classification which produced very high levels of accuracy on unseen examples [36]. After the completion of the CONSTRUE work, the corpus was obtained by David Lewis who prepared it for use in machine learning systems as part of the work for his Ph.D. thesis. Lewis eventually oversaw a cleaning-up and formatting of Reuters that has resulted in its present form [37]. Reuters-21578 is freely available from Lewis’ web site4 and is widely used in text classification experiments, making it one of a small number of corpora which can be used as a point of comparison with other studies.
Table 5.1 contains some statistics relating to Reuters-21578. The statistics in the Basic counts category show simple tallies of each item. The statistics in the Classifications category show the total number of positive and negative classifications summed for all the available classes. The statistics in the Distribution of class labels category indicate the number of classes (topics) that contain the given numbers of documents.
The corpus contains 21578 examples, just over half of which have been assigned at least one topic label. There are 30765 unique words and the average length of a document is 129 words. Many of the documents have no topic labels attached and are assumed to represent negative examples of every class. Fully one half of the topic labels are represented by fewer than 10 documents, and the 10 most frequent classes account for 75% of the total number of positive classifications (instances of assigned class labels) in the corpus.
Table 5.1: Some statistics for the Reuters-21578 corpus.
In order to use a text corpus for machine learning research, it must be split into sets of training and testing examples. There are a number of standard ways to do this described in the notes distributed with the Reuters-21578 data. The “Modified Apte split” (ModApte) first used in [38] was chosen for this research because of the abundance of recent work using this configuration. The ModApte approach splits the corpus at April 7, 1987, placing all documents written on or before this date into the training set, and the rest into the test set. Some 7856 documents are excluded for various reasons, leading to a corpus of 9603 training examples and 3299 testing examples, some of which have no class labels attached. Only classes for which at least one training and test example exists are included, leading to a total of 90 classes as shown in Table 5.2. Each class name is followed by the number of examples found in the training and testing sets.
Table 5.2: A list of the 90 Reuters classes (topics) in the ModApte split.
All examples in Reuters-21578 are marked up with SGML tags that bracket the various parts of the document. An example of a document from Reuters-21578 is shown in Fig 5.1. For our purposes, the important fields are: <TOPICS> , which delimits the list of topic categories, <D> which delimits individual topics within this list, and <TEXT> which delimits the text of the document. The TEXT field is split into sub-fields such as <TITLE>, <DATELINE> and <BODY> but these tags are ignored and all text delimited by the <TEXT> tags is included.
It has been noted in previous classification studies that the topic labels in Reuters-21578 often consist of words that themselves serve as good clues for classification. For example, the presence of the word corn in a newswire story is very good evidence that the topic label corn should be assigned. This fact was exploited to improve classification accuracy in [39]. Nevertheless, the Reuters-21578 corpus does pose some challenging problems for machine learning. Many categories do not correspond directly to words found in the text (ipi, lei, cpi, bop etc.), most classes are represented by a very small relative number of positive examples, and some news stories consist of titles only, or simply contain the words “blah blah blah” in the body. Despite these difficulties, however, researchers have obtained good results on the ModApte split of this corpus.
Fig 5.1: An example of a document from the Reuters-21578 corpus.
This document should be categorized as both vegoil and palmoil. The class labels for the document appear delimited by the <TOPICS> and <D> tags at the top. For the classification experiments in this paper, the text of the document was assumed to be delimited by the <TEXT> tags. SGML tags within the <TEXT> field were ignored. In this example, only the relevant SGML fields are shown.
5.2 Evaluation Measures
Performance of text categorization systems can be evaluated based in their categorization effectiveness. Categorization effectiveness indicates the ability of a text categorization system in providing accurate classification of text documents based on a set of pre-defined categories.
To evaluate the performance of text categorization systems, we need to define measures for the categorization effectiveness. One obvious approach is to redefine precision and recall in the context of text categorization. We can arrive at the definitions by looking at an analogy between the text retrieval task and the text categorization task. In text retrieval, the retrieval system has to make the decision of whether to retrieve a document or net, based in whether the document is relevant to a given query. This decision has to be made for each document and each different query. Similarly, a text categorization system has to make the decision of whether to assign a category to a document or not, based on whether the document is relevant to the topics represented by the category. This decision is made for each document and each pre-defined category. Based on this analogy, we can redefine precision, recall and the F measure which combination of precision and recall in the context of text categorization as follows:
When there are more than one pre-defined category, there are two different ways for computing the precision and recall of the categorization system, referred to as macroaveraging and microaveraging in [40]. In macroaveraging, separate precision and recall values are calculated for each category, and then averaged over all categories to get the overall precision and recall. On the other hand, microaveraging calculates the precision and recall by considering assignment decisions for all categories at once
5.3 Experimental Results
5.3.1 Experimental design
In all of the experiments, we used a subset of the documents from the Reuter-21578 test collection for training and testing our text categorization model. we choose 1600 documents from the Reuters data set with ten frequent categories such as “earn”, ”acq”. 600 documents are used for training and 1000 documents are used for testing.
To create the set of indexing terms in the vocabulary, the 600 training documents were processed by word extraction, stop words removal, and stemming. After word stemming, we merged the sets of stems from each of the 600 training documents. This resulted in a set of 6122 indexing terms in the vocabulary. In order to create the set of initial feature vectors for representing the 600 training documents, we measured the term weight for each of the 6122 indexing terms. The feature vectors were then formed term weight, and each of the feature vectors was the form，
(29)
Where was the term weight of the indexing term in document . For each training and testing documents, we create the feature vectors corresponding to the 600 training documents, where each feature vector was of dimensionality 6122.
In our experiment, we employ logarithmic weight, which can defined as
(30)
Where is the indexing term in document .
Dimensionality of 6122 is so high for neural networks, so we reduce this size by choosing the highest term weights, we choose 1000 terms as the neural network’s input since it offers a reasonable reduction neither too specific nor too general. The number of output node is equal to the number of pre-defined categories. Therefore, for MOPL algorithm, we get the network of 1000 input nodes and 10 output nodes. For BPNN algorithm, we must decide the number of hidden nodes. According to the rule of hidden node selection,
(31)
(32)
(33)
Where is number of hidden node, n is the number of input node, m is the number of output node, is a constants range from (1, 10) and k is the number of training documents, we select 15 as hidden node. In fact, these rules just as a reference to determine the relationship of the number node among each layer, and the number of hidden node select with different rule will yield a big different value. Our network then has three layers of 1000, 15 and 10 nodes.
5.3.2 Experimental results
In our experiment, in order to distinguish our improved method, we compared the mean absolute error use three kinds of method, the first method is the conventional BP networks, we call that traditional BPNN; The second method use the common improved method, we call that Modified BPNN; The third method is our proposed method, we call that MRBP network.
Fig. 5.2: Mean absolute error reduce during training with three methods
From the Fig. 5.2 we can see that at the beginning of the training, the error reduced very fast, but it slow down after certain epochs, and then reduce the error smoothly in traditional BPNN. The Modified BPNN, as we can see, it almost 2-3 time faster then the first method at the beginning of the training, but it rebound when the error reduce to a certain value, and then it occurs vibration during training. However, in our method, it does not occur any morbidity neurons in the first learning phase, so the next learning phase almost as same as the Modified BPNN, but from the third learning phase, it has a big progress, and it also has a good tendency in the later of the training. The performance results are given in table 5.6.
The size of the network and some parameters used in our experience are given in table 5.3, and the computation time of networks are given in table 5.4.
Table 5.3. Network size and parameter
Neural Networks #Input
Nodes #Hidden Nodes #Output Nodes Learning
Rate Momentum Threshold
MOPL 1000 10 0.01 2
BPNN 1000 15 10 0.01 0.8
Table 5.4. Computation time of the networks
Neural Networks Time # Iterations Mean error
MOPL 14.21 s 23
Modified BPNN 776.55 s 3000 0.000854
MRBP Network 785.32 s 3000 0.000092
For MOLP, the performance results are given in table 5.5.
Table 5.3. Performance of MOLP Algorithm
Category Precision Recall
Money-supply 0.848 0.909
coffee 0.824 0.887
gold 0.913 0.901
sugar 0.889 0.853
trade 0.693 0.767
crude 0.946 0.903
grain 0.935 0.929
Money-fx 0.900 0.834
Acq 0.927 0.845
earn 0.932 0.918
micro-average 0.881 0.875
F-measure 0.878
Table5.6. Compared the performance of three kinds of networks
Category Traditional BPNN Modified BPNN MRBP Networks
Precision Recall Precision Recall Precision Recall
Money-supply 0.832 0.913 0.913 0.916 0.938 0.946
coffee 0.847 0.901 0.882 0.900 0.929 0.933
gold 0.943 0.914 0.944 1.000 0.955 1.000
sugar 0.952 0.884 0.954 0.883 0.927 0.914
trade 0.725 0.786 0.766 0.836 0.824 0.895
crude 0.944 0.896 0.945 0.916 1.000 0.932
grain 0.937 0.928 0.928 0.934 0.948 0.924
Money-fx 0.890 0.855 0.908 0.877 0.918 0.912
Acq 0.937 0.873 0.943 0.893 0.943 0.903
earn 0.932 0.923 0.957 0.934 0.967 0.952
micro-average 0.892 0.887 0.914 0.908 0.935 0.931
F-measure 0.889 0.911 0.933
Chapter 6
6. Conclusions and Future Work
This paper proposes an algorithm for text categorization using improved Back-propagation neural network. MRBP detects and rectifies the morbidity neuron in each learning phase. This method overcoming the network paralysis problem and has good abilities to get out from the local minimum. The speed of the training also been increased and the results show in our experiments the MRBP network can achieve higher categorization effectiveness than both MOPL and the Modified BPNN. The superiority of MRBP is obvious especially when the size of the networks is large.
Even though MRBP does not solve the problem that how to fix the structure of BPNN, it gives a rule to adjust the neuron and train the network effectively. Our further work is going to analyze and generalize the previous training as a direction for the next trainings.
Reference
[1] Sebastiani. Machine learning in automated text categorization. Tech-nical report, Consigilo Nazionale delle Rieche, Italy.1999.
[2] T. Mitchell. Machine Learning. McGraw Hill, NY, US, 1996.
[3]. Belew, R. K. Adaptive Information Retrieval. Proceedings of the Twelfth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. NY, NY, 11-20 1989.
[4] Rose, D. E. & Belew, R. K. A Connectionist and Symbolic Hybrid for Improving Legal Search. International Journal of Man-Machine Studies, 35,1-33 1991.
[5] Wong, S.K.M., Cai, Y.J., and Yao, Y.Y. Computation of Term Association by neural Network. SIGIR '93 Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993.
[6] Lin, X., Soergel, D. and Marchionini, G. A Self-Organizing Semantic Map for Information Retrieval. Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 262-269.1991.
[7] MacLeod, K. & Robertson, W. A Neural Algorithm for Document Clustering. Information Processing and Management, 27(3), 337-346. 1991.
[8]Chen, H. and Lynch, K. Automatic Construction of Networks of Concepts Characterizing Document Databases. IEEE Transactions on Systems, Man, and Cybernetics, 22(5), 885-902. 1992.
[9] Chen, H. and Ng T. An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch-and-Bound Search vs. Connectionist Hopfield Net Activation. Journal of the American Society for Information Science. 46(5), 348-369.1993
[10] Lin, C. & Chen, H. An automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents. IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics, 26(1), 75-88.1996.
[11] Belkin, N. J. and Croft, W. B. Information filtering and information retrieval: two sides of the same coin? Communications of the ACM 35, 12, 29–38.1992.
[12] Lewis, D. D.The TREC-4 filtering track: description and analysis. In Proceedings of TREC-4, 4th Text Retrieval Conference (Gaithersburg, US, 1995), pp. 165–180. 1995.
[13] Ho Lam. Using a generalized insance set for automatic text categorization. Technical report, The Chinese UNiversity of HongKong.
[14] Sebastiani, F. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol.34, No.1, March 2002, pages 1-47.2002.
[15] Vapnik The Nature of Statistical Learning Theory. Springer, Berlin.1995.
[16] Kwok, J.T-K. Automated Text Categorization Using Support Vector Machine. Proceedings of the International Conference on Neural InformationProcessing (ICONIP). 1998.
[17]Joachims, T. Text Categorization with Support Vector Machines: Learning with ManyRelevant Features, In Proceedings of 10th European Conference on Machine Learning, Chemnitz, Germany, pages 137-142.1998.
[18].Joachims, T. A Statistical Learning Model of Text Classification for Support VectorMachines. In Proceedings of SIGIR-01, 24th ACM International Conference on Research andDevelopment in Information Retrieval, pages 128-136.2001.
[19] Mitchell, T. M. Machine learning. McGraw Hill, New York, US.1996.
[20] F.Rosenblatt, Principles of Neurodynamics, New York: Spartan Books, 1959.
[21] Sch¨utze, H., Hull, D. A., and Pedersen, J. O. 1995. A comparison of classifiers and document representations for the routing problem. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle,US), pp. 229–237. 1995.
[22] Wiener, E., Pedersen, J. O., and Weigend, A. S. 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US), pp. 317–332.1995.
[23] W.S.McCulloch and W.Pitts, A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, no.5, pp.115-133, 1943.
[24] B.Widrow and M.E.Hoff, Adaptive switching circuits In 1960 IREWESCON Convention Record,(New York) pp.96-104, 1960.
[25] D.E. Rumelhart, G.E.Hinton, and R.J.Williams, Learning internal representations by error propagation, in Parallel distributed processing: explorations in the microstructure of cognition. Ch.8, Cambridge,MA:MIT Press, 1986.
[26] J.J.Hopfield, Neural networks and physical system with emergent collective computational abilities, Proceedings of the National Academy of Sciences, no.79, pp.2554-2558, 1982.
[27].Wei Wu, Guorui Feng, Zhengxue Li, and Yuesheng Xu Deterministic Convergence of an Online Gradient Method for BP Neural Networks IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3,2005.
[28]. P.D.Wasserman.Neural Computing: Theory and Practice [M].New York: Van Nostrand Reinhold.1989.
[29].V.P. Plagianakos, M.N. Vrahatis, Training Neural Networks with Threshold Activation Functions and Constrained Integer Weights, ijcnn, p.5161, IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5,2000.
[30] G.Salton, Automatic Information Organization and Retrieval. New York: McGraw-Hill,1968.
[31] M. F. Porter. An algorithm for suffix stripping. Program, Vol.14 no. 3 130-137.1980.
[32]I.T.Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986
[33] Kumar Han, Karpyris. Text categorization using weight adjusted k-nearest neighbour classification. Technical report, Dept. of CS, University of Minnesota.1999.
[34] Thorsten Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Logic J. of the IGPL, 1998.
[35] David D. Lewis. Distribution 1.0 readme file (v1.2) for reuters-21578.AT&T Labs - Research, 1997.
[36] Philip J. Hayes and Steven P. Weinstein. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. In Proc. Second Annual Conference on Innovative Applications of Artificial Intelligence. 1-5. 1990.
[37] David D. Lewis. Representation and Learning in Information Retrieval. Ph.D. thesis, University of Massachusetts at Amherst. Technical Report 91-93. February, 1992.
[38] Chidanand Apte, Fred Damerau and Sholom M. Weiss. Toward Language Independent Automated Learning of Text Categorization. In Proc. SIGIR-94. 23-30. 1994.
[39] Manuel de Buenaga Rodríguez, José María Gômez-Hidalgo and Belén Díaz-Agudo. Using WordNet to Complement Training Information in Text Categorization. In Proc. RANLP-97.. 150-157. March 25-27, 1997.
[40] D.D.Lewis, Evaluating text categorization, in Proceedings of the Speech and Natural Language Workshop,1991.
Acknowledgement
I would like to express my greatest thanks to my academic supervisor Dr. Soon Cheol Park who drew my attention to the text categorization and neural networks. His deep involvement, insightful advices and encouragement have been invaluable during the past two years and the preparation of this thesis.
I am grateful to Dr DongSun Park and Dr JaeDong Yang being my thesis committee members. I also thanks to my lab-mates. Though our discussions and seminars, I have gained much both in the research an in my personal development.
Finally, I would like to thank my parents, my girlfriend Liu lijun for their patience, understanding and support.

분석정보

활용도 분석

논문 주제 분석

연구자 주제 분석

View

상세정보조회

usage

원문다운로드

대출신청

복사신청

EDDS신청

동일 주제 내 활용도 TOP

연도별 연구동향

연도별 활용동향

연관논문

연구자 네트워크 맵

공동연구자 (0)
유사연구자 (0) 활용도상위20명

연관 공개강의

원문보기
음성듣기
인용하기

내책장담기

닫기

표가 있는 항목은 반드시 기재해 주셔야 합니다.

동일자료를 같은 책장에 담을 경우 담은 시점만 최신으로 수정됩니다.

책장선택
새 책장 만들어 자료 담기
선택한 자료

취소 내책장에 담기

새책장 만들기

닫기

표가 있는 항목은 반드시 기재해 주셔야 합니다.

새책장 명
새책장 설명

책장설명은 [내 책장/책장목록]에서 수정할 수 있습니다.
새책장 카테고리
- 카테고리
  취소
- 카테고리
  취소
- 카테고리
  취소
책장카테고리 설정은 최소1개 ~ 최대3개까지 가능합니다.
Default 설정
설정
선택한 자료

취소 새 책장 만들어 자료 담기

관심분야 검색

닫기

트리 또는 검색을 통해 관심분야를 선택 하신 후 등록 버튼을 클릭 하십시오.

관심분야 검색 결과
대분류	중분류	관심분야명
검색결과가 없습니다.

선택된 분야 (선택분야는 최소 하나 이상 선택 하셔야 합니다.)

카테고리
취소
카테고리
취소
카테고리
취소

등록

소장기관 정보

닫기

담당자
부서명
연락처
홈페이지
전자메일
주소
상호대차
문헌복사

문헌복사 및 대출서비스 정책은 제공도서관 규정에 따라 상이할수 있음

권호소장정보

닫기

오류접수

닫기

오류접수

[ 논문 정보 ]

[ 신청자 정보 ]

아이디
전화번호
- - SMS알림요청동의
오류내용

※ 개인정보 노출 방지를 위해 민감한 개인정보 내용은 가급적 기재를 자제하여 주시기 바랍니다.
(상담을 위해 불가피하게 개인정보를 기재하셔야 한다면, 가능한 최소한의 개인정보를 기재하여 주시기 바랍니다.)
메일주소
작성하신 내용을 완료하시려면 [보내기] 버튼을, 수정하시려면 [수정]버튼을 눌러주세요.

취소 오류접수 수정 보내기

오류 접수 확인

닫기

오류 접수 확인

고객님, 오류접수가 정상적으로 신청 되었습니다.

문의하신 내용은 3시간 이내에 메일로 답변을 드릴 수 있도록 하겠습니다.

다만, 고객센터 운영시간(평일 09:00 ~ 18:00) 이외에는

처리가 다소 지연될 수 있으니 이 점 양해 부탁 드립니다.

로그인 후, 문의하신 내용은 나의상담내역에서 조회하실 수 있습니다.

[접수 정보]

접수일시
답변받을 이메일

나의 상담내역 바로가기

음성서비스 신청

닫기

음성서비스 신청

[ 논문 정보 ]

[ 신청자 정보 ]

아이디
전화번호
- - SMS 및 카카오알림 동의
메일주소
작성하신 내용을 완료하시려면 [신청] 버튼을, 수정하시려면 [수정]버튼을 눌러주세요.

취소 음성서비스 신청 수정 신청

음성서비스 신청 확인

닫기

음성서비스 신청 확인

서비스 신청이 완료되었습니다.

신청하신 내역에 대한 처리 완료 시 메일로 별도 안내 드리도록 하겠습니다.

음성서비스 신청 증가 등의 이유로 처리가 다소 지연될 수 있으니, 이 점 양해 부탁드립니다.

감합니다.

[신청 정보]

신청일시
답변받을 이메일

41061 대구광역시 동구 동내로 64(동내동1119) KERIS빌딩

고객센터 (평일: 09:00 ~ 18:00)1599-3122

PC 버전 바로가기

회원가입

학술연구정보서비스 검색

MYRISS

회원서비스

About RISS

RISS 처음 방문 이세요?

고객센터

RISS 활용도 분석

최신/인기 학술자료

해외자료신청(E-DDS)

공지사항

해외전자정보서비스 검색

Databases & Journals

해외전자자료 이용안내

이용약관

닫기

학술연구정보서비스 이용약관 (2017년 1월 1일 ~ 현재 적용)

제 1 장 총칙
1. 제 1 조 (목적)
  - 이 약관은 한국교육학술정보원(이하 "교육정보원"라 함)이 제공하는 학술연구정보서비스의 웹사이트(이하 "서비스" 라함)의 이용에 관한 조건 및 절차와 기타 필요한 사항을 규정하는 것을 목적으로 합니다.
2. 제 2 조 (약관의 효력과 변경)
  1. ① 이 약관은 서비스 메뉴에 게시하여 공시함으로써 효력을 발생합니다.
  2. ② 교육정보원은 합리적 사유가 발생한 경우에는 이 약관을 변경할 수 있으며, 약관을 변경한 경우에는 지체없이 "공지사항"을 통해 공시합니다.
  3. ③ 이용자는 변경된 약관사항에 동의하지 않으면, 언제나 서비스 이용을 중단하고 이용계약을 해지할 수 있습니다.
3. 제 3 조 (약관외 준칙)
  - 이 약관에 명시되지 않은 사항은 관계 법령에 규정 되어있을 경우 그 규정에 따르며, 그렇지 않은 경우에는 일반적인 관례에 따릅니다.
4. 제 4 조 (용어의 정의)
  이 약관에서 사용하는 용어의 정의는 다음과 같습니다.
  1. ① 이용자 : 교육정보원과 이용계약을 체결한 자
  2. ② 이용자번호(ID) : 이용자 식별과 이용자의 서비스 이용을 위하여 이용계약 체결시 이용자의 선택에 의하여 교육정보원이 부여하는 문자와 숫자의 조합
  3. ③ 비밀번호 : 이용자 자신의 비밀을 보호하기 위하여 이용자 자신이 설정한 문자와 숫자의 조합
  4. ④ 단말기 : 서비스 제공을 받기 위해 이용자가 설치한 개인용 컴퓨터 및 모뎀 등의 기기
  5. ⑤ 서비스 이용 : 이용자가 단말기를 이용하여 교육정보원의 주전산기에 접속하여 교육정보원이 제공하는 정보를 이용하는 것
  6. ⑥ 이용계약 : 서비스를 제공받기 위하여 이 약관으로 교육정보원과 이용자간의 체결하는 계약을 말함
  7. ⑦ 마일리지 : RISS 서비스 중 마일리지 적립 가능한 서비스를 이용한 이용자에게 지급되며, RISS가 제공하는 특정 디지털 콘텐츠를 구입하는 데 사용하도록 만들어진 포인트
제 2 장 서비스 이용 계약
1. 제 5 조 (이용계약의 성립)
  1. ① 이용계약은 이용자의 이용신청에 대한 교육정보원의 이용 승낙에 의하여 성립됩니다.
  2. ② 제 1항의 규정에 의해 이용자가 이용 신청을 할 때에는 교육정보원이 이용자 관리시 필요로 하는
    사항을 전자적방식(교육정보원의 컴퓨터 등 정보처리 장치에 접속하여 데이터를 입력하는 것을 말합니다)
    이나 서면으로 하여야 합니다.
  3. ③ 이용계약은 이용자번호 단위로 체결하며, 체결단위는 1 이용자번호 이상이어야 합니다.
  4. ④ 서비스의 대량이용 등 특별한 서비스 이용에 관한 계약은 별도의 계약으로 합니다.
2. 제 6 조 (이용신청)
  1. ① 서비스를 이용하고자 하는 자는 교육정보원이 지정한 양식에 따라 온라인신청을 이용하여 가입 신청을 해야 합니다.
  2. ② 이용신청자가 14세 미만인자일 경우에는 친권자(부모, 법정대리인 등)의 동의를 얻어 이용신청을 하여야 합니다.
3. 제 7 조 (이용계약 승낙의 유보)
  1. ① 교육정보원은 다음 각 호에 해당하는 경우에는 이용계약의 승낙을 유보할 수 있습니다.
    1. 1. 설비에 여유가 없는 경우
    2. 2. 기술상에 지장이 있는 경우
    3. 3. 이용계약을 신청한 사람이 14세 미만인 자로 친권자의 동의를 득하지 않았을 경우
    4. 4. 기타 교육정보원이 서비스의 효율적인 운영 등을 위하여 필요하다고 인정되는 경우
  2. ② 교육정보원은 다음 각 호에 해당하는 이용계약 신청에 대하여는 이를 거절할 수 있습니다.
    1. 1. 다른 사람의 명의를 사용하여 이용신청을 하였을 때
    2. 2. 이용계약 신청서의 내용을 허위로 기재하였을 때
4. 제 8 조 (계약사항의 변경)
  이용자는 다음 사항을 변경하고자 하는 경우 서비스에 접속하여 서비스 내의 기능을 이용하여 변경할 수 있습니다.
  1. ① 성명 및 생년월일, 신분, 이메일
  2. ② 비밀번호
  3. ③ 자료신청 / 기관회원서비스 권한설정을 위한 이용자정보
  4. ④ 전화번호 등 개인 연락처
  5. ⑤ 기타 교육정보원이 인정하는 경미한 사항
제 3 장 서비스의 이용
1. 제 9 조 (서비스 이용시간)
  - 서비스의 이용 시간은 교육정보원의 업무 및 기술상 특별한 지장이 없는 한 연중무휴, 1일 24시간(00:00-24:00)을 원칙으로 합니다. 다만 정기점검등의 필요로 교육정보원이 정한 날이나 시간은 그러하지 아니합니다.
2. 제 10 조 (이용자번호 등)
  1. ① 이용자번호 및 비밀번호에 대한 모든 관리책임은 이용자에게 있습니다.
  2. ② 명백한 사유가 있는 경우를 제외하고는 이용자가 이용자번호를 공유, 양도 또는 변경할 수 없습니다.
  3. ③ 이용자에게 부여된 이용자번호에 의하여 발생되는 서비스 이용상의 과실 또는 제3자에 의한 부정사용 등에 대한 모든 책임은 이용자에게 있습니다.
3. 제 11 조 (서비스 이용의 제한 및 이용계약의 해지)
  1. ① 이용자가 서비스 이용계약을 해지하고자 하는 때에는 온라인으로 교육정보원에 해지신청을 하여야 합니다.
  2. ② 교육정보원은 이용자가 다음 각 호에 해당하는 경우 사전통지 없이 이용계약을 해지하거나 전부 또는 일부의 서비스 제공을 중지할 수 있습니다.
    1. 1. 타인의 이용자번호를 사용한 경우
    2. 2. 다량의 정보를 전송하여 서비스의 안정적 운영을 방해하는 경우
    3. 3. 수신자의 의사에 반하는 광고성 정보, 전자우편을 전송하는 경우
    4. 4. 정보통신설비의 오작동이나 정보 등의 파괴를 유발하는 컴퓨터 바이러스 프로그램등을 유포하는 경우
    5. 5. 정보통신윤리위원회로부터의 이용제한 요구 대상인 경우
    6. 6. 선거관리위원회의 유권해석 상의 불법선거운동을 하는 경우
    7. 7. 서비스를 이용하여 얻은 정보를 교육정보원의 동의 없이 상업적으로 이용하는 경우
    8. 8. 비실명 이용자번호로 가입되어 있는 경우
    9. 9. 일정기간 이상 서비스에 로그인하지 않거나 개인정보 수집․이용에 대한 재동의를 하지 않은 경우
  3. ③ 전항의 규정에 의하여 이용자의 이용을 제한하는 경우와 제한의 종류 및 기간 등 구체적인 기준은 교육정보원의 공지, 서비스 이용안내, 개인정보처리방침 등에서 별도로 정하는 바에 의합니다.
  4. ④ 해지 처리된 이용자의 정보는 법령의 규정에 의하여 보존할 필요성이 있는 경우를 제외하고 지체 없이 파기합니다.
  5. ⑤ 해지 처리된 이용자번호의 경우, 재사용이 불가능합니다.
4. 제 12 조 (이용자 게시물의 삭제 및 서비스 이용 제한)
  1. ① 교육정보원은 서비스용 설비의 용량에 여유가 없다고 판단되는 경우 필요에 따라 이용자가 게재 또는 등록한 내용물을 삭제할 수 있습니다.
  2. ② 교육정보원은 서비스용 설비의 용량에 여유가 없다고 판단되는 경우 이용자의 서비스 이용을 부분적으로 제한할 수 있습니다.
  3. ③ 제 1 항 및 제 2 항의 경우에는 당해 사항을 사전에 온라인을 통해서 공지합니다.
  4. ④ 교육정보원은 이용자가 게재 또는 등록하는 서비스내의 내용물이 다음 각호에 해당한다고 판단되는 경우에 이용자에게 사전 통지 없이 삭제할 수 있습니다.
    1. 1. 다른 이용자 또는 제 3자를 비방하거나 중상모략으로 명예를 손상시키는 경우
    2. 2. 공공질서 및 미풍양속에 위반되는 내용의 정보, 문장, 도형 등을 유포하는 경우
    3. 3. 반국가적, 반사회적, 범죄적 행위와 결부된다고 판단되는 경우
    4. 4. 다른 이용자 또는 제3자의 저작권 등 기타 권리를 침해하는 경우
    5. 5. 게시 기간이 규정된 기간을 초과한 경우
    6. 6. 이용자의 조작 미숙이나 광고목적으로 동일한 내용의 게시물을 10회 이상 반복하여 등록하였을 경우
    7. 7. 기타 관계 법령에 위배된다고 판단되는 경우
5. 제 13 조 (서비스 제공의 중지 및 제한)
  1. ① 교육정보원은 다음 각 호에 해당하는 경우 서비스 제공을 중지할 수 있습니다.
    1. 1. 서비스용 설비의 보수 또는 공사로 인한 부득이한 경우
    2. 2. 전기통신사업법에 규정된 기간통신사업자가 전기통신 서비스를 중지했을 때
  2. ② 교육정보원은 국가비상사태, 서비스 설비의 장애 또는 서비스 이용의 폭주 등으로 서비스 이용에 지장이 있는 때에는 서비스 제공을 중지하거나 제한할 수 있습니다.
6. 제 14 조 (교육정보원의 의무)
  1. ① 교육정보원은 교육정보원에 설치된 서비스용 설비를 지속적이고 안정적인 서비스 제공에 적합하도록 유지하여야 하며 서비스용 설비에 장애가 발생하거나 또는 그 설비가 못쓰게 된 경우 그 설비를 수리하거나 복구합니다.
  2. ② 교육정보원은 서비스 내용의 변경 또는 추가사항이 있는 경우 그 사항을 온라인을 통해 서비스 화면에 공지합니다.
7. 제 15 조 (개인정보보호)
  1. ① 교육정보원은 공공기관의 개인정보보호에 관한 법률, 정보통신이용촉진등에 관한 법률 등 관계법령에 따라 이용신청시 제공받는 이용자의 개인정보 및 서비스 이용중 생성되는 개인정보를 보호하여야 합니다.
  2. ② 교육정보원의 개인정보보호에 관한 관리책임자는 학술연구정보서비스 이용자 관리담당 부서장(학술정보본부)이며, 주소 및 연락처는 대구광역시 동구 동내로 64(동내동 1119) KERIS빌딩, 전화번호 054-714-0114번, 전자메일 privacy@keris.or.kr 입니다. 개인정보 관리책임자의 성명은 별도로 공지하거나 서비스 안내에 게시합니다.
  3. ③ 교육정보원은 개인정보를 이용고객의 별도의 동의 없이 제3자에게 제공하지 않습니다. 다만, 다음 각 호의 경우는 이용고객의 별도 동의 없이 제3자에게 이용 고객의 개인정보를 제공할 수 있습니다.
    1. 1. 수사상의 목적에 따른 수사기관의 서면 요구가 있는 경우에 수사협조의 목적으로 국가 수사 기관에 성명, 주소 등 신상정보를 제공하는 경우
    2. 2. 신용정보의 이용 및 보호에 관한 법률, 전기통신관련법률 등 법률에 특별한 규정이 있는 경우
    3. 3. 통계작성, 학술연구 또는 시장조사를 위하여 필요한 경우로서 특정 개인을 식별할 수 없는 형태로 제공하는 경우
  4. ④ 이용자는 언제나 자신의 개인정보를 열람할 수 있으며, 스스로 오류를 수정할 수 있습니다. 열람 및 수정은 원칙적으로 이용신청과 동일한 방법으로 하며, 자세한 방법은 공지, 이용안내에 정한 바에 따릅니다.
  5. ⑤ 이용자는 언제나 이용계약을 해지함으로써 개인정보의 수집 및 이용에 대한 동의, 목적 외 사용에 대한 별도 동의, 제3자 제공에 대한 별도 동의를 철회할 수 있습니다. 해지의 방법은 이 약관에서 별도로 규정한 바에 따릅니다.
8. 제 16 조 (이용자의 의무)
  1. ① 이용자는 서비스를 이용할 때 다음 각 호의 행위를 하지 않아야 합니다.
    1. 1. 다른 이용자의 이용자번호를 부정하게 사용하는 행위
    2. 2. 서비스를 이용하여 얻은 정보를 교육정보원의 사전승낙없이 이용자의 이용이외의 목적으로 복제하거나 이를 출판, 방송 등에 사용하거나 제3자에게 제공하는 행위
    3. 3. 다른 이용자 또는 제3자를 비방하거나 중상모략으로 명예를 손상하는 행위
    4. 4. 공공질서 및 미풍양속에 위배되는 내용의 정보, 문장, 도형 등을 타인에게 유포하는 행위
    5. 5. 반국가적, 반사회적, 범죄적 행위와 결부된다고 판단되는 행위
    6. 6. 다른 이용자 또는 제3자의 저작권등 기타 권리를 침해하는 행위
    7. 7. 기타 관계 법령에 위배되는 행위
  2. ② 이용자는 이 약관에서 규정하는 사항과 서비스 이용안내 또는 주의사항을 준수하여야 합니다.
  3. ③ 이용자가 설치하는 단말기 등은 전기통신설비의 기술기준에 관한 규칙이 정하는 기준에 적합하여야 하며, 서비스에 장애를 주지 않아야 합니다.
9. 제 17 조 (광고의 게재)
  교육정보원은 서비스의 운용과 관련하여 서비스화면, 홈페이지, 전자우편 등에 광고 등을 게재할 수 있습니다.
제 4 장 서비스 이용 요금
1. 제 18 조 (이용요금)
  1. ① 서비스 이용료는 기본적으로 무료로 합니다. 단, 민간업체와의 협약에 의해 RISS를 통해 서비스 되는 콘텐츠의 경우 각 민간 업체의 요금 정책에 따라 유료로 서비스 합니다.
  2. ② 그 외 교육정보원의 정책에 따라 이용 요금 정책이 변경될 경우에는 온라인으로 서비스 화면에 게시합니다.
제 5 장 마일리지 정책
1. 제 19 조 (마일리지 정책의 변경)
  1. ① RISS 마일리지는 2017년 1월부로 모두 소멸되었습니다.
  2. ② 교육정보원은 마일리지 적립ㆍ사용ㆍ소멸 등 정책의 변경에 대해 온라인상에 공지해야하며, 최근에 온라인에 등재된 내용이 이전의 모든 규정과 조건보다 우선합니다.
제 6 장 저작권
1. 제 20 조 (게재된 자료에 대한 권리)
  서비스에 게재된 자료에 대한 권리는 다음 각 호와 같습니다.
  1. ① 게시물에 대한 권리와 책임은 게시자에게 있으며, 교육정보원은 게시자의 동의 없이는 이를 영리적 목적으로 사용할 수 없습니다.
  2. ② 게시자의 사전 동의가 없이는 이용자는 서비스를 이용하여 얻은 정보를 가공, 판매하는 행위 등 서비스에 게재된 자료를 상업적 목적으로 이용할 수 없습니다.
제 7 장 이의 신청 및 손해배상 청구 금지
1. 제 21 조 (이의신청금지)
  이용자는 교육정보원에서 제공하는 서비스 이용시 발생되는 어떠한 문제에 대해서도 무료 이용 기간 동안은 이의 신청 및 민원을 제기할 수 없습니다.
2. 제 22 조 (손해배상청구금지)
  이용자는 교육정보원에서 제공하는 서비스 이용시 발생되는 어떠한 문제에 대해서도 무료 이용 기간 동안은 교육정보원 및 관계 기관에 손해배상 청구를 할 수 없으며 교육정보원은 이에 대해 책임을 지지 아니합니다.
부칙
이 약관은 2000년 6월 1일부터 시행합니다.
부칙(개정 2005. 5. 31)
이 약관은 2005년 5월 31일부터 시행합니다.
부칙(개정 2010. 1. 1)
이 약관은 2010년 1월 1일부터 시행합니다.
부칙(개정 2010. 4 1)
이 약관은 2010년 4월 1일부터 시행합니다.
부칙(개정 2017. 1 1)
이 약관은 2017년 1월 1일부터 시행합니다.

학술연구정보서비스 개인정보처리방침

Ver 8.6 (2023년 1월 31일 ~ )

닫기

학술연구정보서비스(이하 RISS)는 정보주체의 자유와 권리 보호를 위해 「개인정보 보호법」 및 관계 법령이 정한 바를 준수하여, 적법하게 개인정보를 처리하고 안전하게 관리하고 있습니다. 이에 「개인정보 보호법」 제30조에 따라 정보주체에게 개인정보 처리에 관한 절차 및 기준을 안내하고, 이와 관련한 고충을 신속하고 원활하게 처리할 수 있도록 하기 위하여 다음과 같이 개인정보 처리방침을 수립·공개합니다.

주요 개인정보 처리 표시(라벨링)

처리 목적
보유 기간
처리 항목
개인 정보
제3자 제공
처리 위탁
파기
정보주체의 권리의무
안전성확보조치
자동화 수집
개인정보보호책임자
열람 청구
권익침해 구제
추가적 이용
처리방침 변경

목 차

제1조(개인정보의 처리 목적)
제2조(개인정보의 처리 및 보유 기간)
제3조(처리하는 개인정보의 항목)
제4조(개인정보파일 등록 현황)
제5조(개인정보의 제3자 제공)
제6조(개인정보 처리업무의 위탁)
제7조(개인정보의 파기 절차 및 방법)
제8조(정보주체와 법정대리인의 권리·의무 및 그 행사 방법)
제9조(개인정보의 안전성 확보조치)
제10조(개인정보 자동 수집 장치의 설치·운영 및 거부)
제11조(개인정보 보호책임자)
제12조(개인정보의 열람청구를 접수·처리하는 부서)
제13조(정보주체의 권익침해에 대한 구제방법)
제14조(추가적 이용·제공 판단기준)
제15조(개인정보 처리방침의 변경)

제1조(개인정보의 처리 목적)

RISS는 개인정보를 다음의 목적을 위해 처리합니다. 처리하고 있는 개인정보는 다음의 목적 이외의 용도로는 이용되지 않으며, 이용 목적이 변경되는 경우에는 「개인정보 보호법」 제18조에 따라 별도의 동의를 받는 등 필요한 조치를 이행할 예정입니다.

가. 서비스 제공
     - 콘텐츠 제공, 문헌배송 및 결제, 요금정산 등 서비스 제공
나. 회원관리
     - 회원제 서비스 이용에 따른 본인확인,
     - 만14세 미만 아동 개인 정보 수집 시 법정 대리인 동의여부 확인, 추후 법정 대리인 본인확인
     - 분쟁 조정을 위한 기록보존, 불만처리 등을 위한 원활한 의사소통 경로의 확보, 공지사항 전달
다. 서비스 개선
     - 신규 서비스 개발 및 특화
     - 통계학적 특성에 따른 서비스 제공
     - 서비스 이용에 대한 통계

제2조(개인정보의 처리 및 보유 기간)

가. 처리기간 및 보유 기간:

3년

또는 회원탈퇴시까지
나. 다만, 다음의 사유에 해당하는 경우에는 해당 사유 종료시 까지 정보를 보유 및 열람합니다.
     ▶ 신청 중인 서비스가 완료 되지 않은 경우
          - 보존 이유 : 진행 중인 서비스 완료(예:원문복사 등)
          - 보존 기간 : 서비스 완료 시까지
          - 열람 예정 시기 : 수시(RISS에서 신청된 서비스의 처리내역 및 진행상태 확인 요청 시)
     ▶ 관련법령에 의한 정보보유 사유 및 기간
          - 대금결제 및 재화 등의 공급에 관한 기록 :

5년

(「전자상거래 등에서의 소비자보호에 관한
법률」 제 6조 및 시행령 제 6조)
- 소비자의 불만 또는 분쟁 처리에 관한 기록 :

3년

(「전자상거래 등에서의 소비자보호에 관한
법률」 제 6조 및 시행령 제 6조)
- 접속에 관한 기록 :

2년

이상(개인정보보호위원회 : 개인정보의 안전성 확보조치 기준)

제3조(처리하는 개인정보의 항목)

가. 필수 항목 : ID, 이름, 생년월일, 신분(직업구분), 이메일, 소속분야,
보호자 성명(어린이회원), 보호자 이메일(어린이회원)
나: 선택 항목 : 소속기관명, 학과/부서명, 학번/직원번호, 전화, 주소, 장애인 여부
다. 자동수집항목 : IP주소, ID, 서비스 이용기록, 방문기록

제4조(개인정보파일 등록 현황)


개인정보파일의 명칭	운영근거 / 처리목적	개인정보파일에 기록되는 개인정보의 항목		보유기간
학술연구정보서비스 이용자 가입정보 파일	한국교육학술정보원법	필수	ID, 비밀번호, 성명, 생년월일, 신분(직업구분), 이메일, 소속분야, 웹진메일 수신동의 여부	3년 또는 탈퇴시
학술연구정보서비스 이용자 가입정보 파일	한국교육학술정보원법	선택	소속기관명, 소속도서관명, 학과/부서명, 학번/직원번호, 휴대전화, 주소	3년 또는 탈퇴시

제5조(개인정보의 제3자 제공)

가. RISS는 원칙적으로 정보주체의 개인정보를 제1조(개인정보의 처리 목적)에서 명시한 범위 내에서
     처리하며, 정보주체의 사전 동의 없이는 본래의 범위를 초과하여 처리하거나 제3자에게 제공하지
     않습니다. 단, 정보주체의 동의, 법률의 특별한 규정 등 개인정보 보호법 제17조 및 제18조에 해당하는
     경우에만 개인정보를 제3자에게 제공합니다.
나. RISS는 원활한 서비스 제공을 위해 다음의 경우 정보주체의 동의를 얻어 필요 최소한의 범위로만
     제공합니다.
     - 복사/대출 배송 서비스를 위해서 아래와 같이 개인정보를 제공합니다.
          1. 개인정보 제공 대상 : 제공도서관, ㈜이니시스(선불결제 시)
          2. 개인정보 제공 목적 : 복사/대출 서비스 제공
          3. 개인정보 제공 항목 : 이름, 전화번호, 이메일
          4. 개인정보 보유 및 이용 기간 : 신청건 발생일 후 5년
▶ 개인정보 제공에 동의하지 않을 권리가 있으며, 거부하는 경우 서비스 이용이 불가합니다.

제6조(개인정보 처리업무의 위탁)

RISS는 원활한 개인정보 업무처리를 위하여 다음과 같이 개인정보 처리업무를 위탁하고 있습니다.

가. 위탁하는 업무 내용 : 회원 개인정보 처리
나. 수탁업체명 : ㈜퓨쳐누리

RISS는 위탁계약 체결 시 「개인정보 보호법」 제26조에 따라 위탁업무 수행 목적 외 개인정보 처리금지, 안전성 확보조치, 재위탁 제한, 수탁자에 대한 관리·감독, 손해배상 등 책임에 관한 사항을 계약서 등 문서에 명시하고, 수탁자가 개인정보를 안전하게 처리하는지를 감독하고 있습니다.

위탁업무의 내용이나 수탁자가 변경될 경우에는 지체 없이 본 개인정보 처리방침을 통하여 공개하도록 하겠습니다.

제7조(개인정보의 파기 절차 및 방법)

가. 파기절차
     - 개인정보의 파기 : 보유기간이 경과한 개인정보는 종료일로부터 지체 없이 파기
     - 개인정보파일의 파기 : 개인정보파일의 처리 목적 달성, 해당 서비스의 폐지, 사업의 종료 등 그
      개인정보파일이 불필요하게 되었을 때에는 개인정보의 처리가 불필요한 것으로 인정되는 날로부터
      지체 없이 그 개인정보파일을 파기.
나. 파기방법
     - 전자적 형태의 정보는 기록을 재생할 수 없는 기술적 방법을 사용하여 파기.
     - 종이에 출력된 개인정보는 분쇄기로 분쇄하거나 소각을 통하여 파기.

제8조(정보주체와 법정대리인의 권리·의무 및 그 행사 방법)

정보주체(만 14세 미만인 경우에는 법정대리인을 말함)는 개인정보주체로서 다음과 같은 권리를 행사할 수 있습니다.
가. 권리 행사 항목 및 방법
     - 권리 행사 항목: 개인정보 열람 요구, 오류 정정 요구, 삭제 요구, 처리정지 요구
     - 권리 행사 방법: 개인정보 처리 방법에 관한 고시 별지 제8호(대리인의 경우 제11호) 서식에 따라
      작성 후 서면, 전자우편, 모사전송(FAX), 전화, 인터넷(홈페이지 고객센터) 제출
나. 개인정보 열람 및 처리정지 요구는 「개인정보 보호법」 제35조 제5항, 제37조 제2항에 의하여
      정보주체의 권리가 제한 될 수 있음
다. 개인정보의 정정 및 삭제 요구는 다른 법령에서 그 개인정보가 수집 대상으로 명시되어 있는 경우에는
      그 삭제를 요구할 수 없음
라. RISS는 정보주체 권리에 따른 열람의 요구, 정정·삭제의 요구, 처리정지의 요구 시
      열람 등 요구를 한 자가 본인이거나 정당한 대리인인지를 확인함.
마. 정보주체의 권리행사 요구 거절 시 불복을 위한 이의제기 절차는 다음과 같습니다.
     1) 해당 부서에서 열람 등 요구에 대한 연기 또는 거절 시 요구 받은 날로부터 10일 이내에 정당한 사유
        및 이의제기 방법 등을 통지
     2) 해당 부서에서 정보주체의 이의제기 신청 및 접수(서면, 유선, 이메일 등)하여 개인정보보호 담당자가
        내용 확인
     3) 개인정보관리책임자가 처리결과에 대한 최종 검토
     4) 해당부서에서 정보주체에게 처리결과 통보
*. [교육부 개인정보 보호지침 별지 제1호] 개인정보 (열람, 정정·삭제, 처리정지) 요구서
*. [교육부 개인정보 보호지침 별지 제2호] 위임장

제9조(개인정보의 안전성 확보조치)

가. 내부관리계획의 수립 및 시행 : RISS의 내부관리계획 수립 및 시행은 한국교육학술정보원의 내부
      관리 지침을 준수하여 시행.
나. 개인정보 취급 담당자의 최소화 및 교육
     - 개인정보를 취급하는 분야별 담당자를 지정․운영
     - 한국교육학술정보원의 내부 관리 지침에 따른 교육 실시
다. 개인정보에 대한 접근 제한
     - 개인정보를 처리하는 데이터베이스시스템에 대한 접근권한의 부여, 변경, 말소를 통하여
     개인정보에 대한 접근통제 실시
     - 침입차단시스템, ID/패스워드 및 공인인증서 확인을 통한 접근 통제 등 보안시스템 운영
라. 접속기록의 보관 및 위변조 방지
     - 개인정보처리시스템에 접속한 기록(웹 로그, 요약정보 등)을 2년 이상 보관, 관리
     - 접속 기록이 위변조 및 도난, 분실되지 않도록 보안기능을 사용
마. 개인정보의 암호화 : 이용자의 개인정보는 암호화 되어 저장 및 관리
바. 해킹 등에 대비한 기술적 대책
     - 보안프로그램을 설치하고 주기적인 갱신·점검 실시
     - 외부로부터 접근이 통제된 구역에 시스템을 설치하고 기술적/물리적으로 감시 및 차단
사. 비인가자에 대한 출입 통제
     - 개인정보를 보관하고 있는 개인정보시스템의 물리적 보관 장소를 별도 설치․운영
     - 물리적 보관장소에 대한 출입통제, CCTV 설치․운영 절차를 수립, 운영

제10조(개인정보 자동 수집 장치의 설치·운영 및 거부)

가. 정보주체의 이용정보를 저장하고 수시로 불러오는 ‘쿠키(cookie)’를 사용합니다.
나. 쿠키는 웹사이트를 운영하는데 이용되는 서버(http)가 이용자의 컴퓨터브라우저에게 보내는 소량의
     정보이며 이동자들의 PC 컴퓨터내의 하드디스크에 저장되기도 합니다.
     1) 쿠키의 사용목적 : 이용자에게 보다 편리한 서비스 제공하기 위해 사용됩니다.
     2) 쿠키의 설치·운영 및 거부 : 브라우저 옵션 설정을 통해 쿠키 허용, 쿠키 차단 등의 설정을 할 수
          있습니다.
          - Internet Explorer : 웹브라우저 우측 상단의 도구 메뉴 > 인터넷 옵션 > 개인정보 > 설정 > 고급
          - Edge : 웹브라우저 우측 상단의 설정 메뉴 > 쿠키 및 사이트 권한 > 쿠키 및 사이트 데이터
             관리 및 삭제
          - Chrome : 웹브라우저 우측 상단의 설정 메뉴 > 보안 및 개인정보 보호 > 쿠키 및 기타 사이트
             데이터
     3) 쿠키 저장을 거부 또는 차단할 경우 서비스 이용에 어려움이 발생할 수 있습니다.

제11조(개인정보 보호책임자)

가. RISS는 개인정보 처리에 관한 업무를 총괄해서 책임지고, 개인정보 처리와 관련한 정보주체의
불만처리 및 피해구제 등을 위하여 아래와 같이 개인정보 보호책임자를 지정하고 있습니다.


구분	담당자	연락처
KERIS 개인정보 보호책임자	정보보호본부 김태우	- 이메일 : lsy@keris.or.kr - 전화번호 : 053-714-0439 - 팩스번호 : 053-714-0195
KERIS 개인정보 보호담당자	개인정보보호부 이상엽
RISS 개인정보 보호책임자	대학학술본부 장금연	- 이메일 : giltizen@keris.or.kr - 전화번호 : 053-714-0149 - 팩스번호 : 053-714-0194
RISS 개인정보 보호담당자	학술진흥부 길원진

나. 정보주체는 RISS의 서비스(또는 사업)을 이용하시면서 발생한 모든 개인정보 보호 관련 문의, 불만처리,
피해구제 등에 관한 사항을 개인정보 보호책임자 및 담당부서로 문의 할 수 있습니다.
RISS는 정보주체의 문의에 대해 답변 및 처리해드릴 것입니다.

제12조(개인정보의 열람청구를 접수·처리하는 부서)

가. 자체 개인정보 열람청구 접수ㆍ처리 창구
     부서명 : 대학학술본부/학술진흥부
     담당자 : 길원진
     이메일 : giltizen@keris.or.kr
     전화번호 : 053-714-0149
     팩스번호 : 053-714-0194
나. 개인정보 열람청구 접수ㆍ처리 창구
     - 개인정보보호 포털 웹사이트(www.privacy.go.kr)
     - 개인정보보호 포털 → 민원마당 → 개인정보 열람 등 요구(본인확인을 위한
       휴대전화·아이핀(I-PIN) 등이 있어야 함)

제13조(정보주체의 권익침해에 대한 구제방법)

‣ 정보주체는 개인정보침해로 인한 구제를 받기 위하여 개인정보분쟁조정위원회, 한국인터넷진흥원
   개인정보침해신고센터 등에 분쟁해결이나 상담 등을 신청할 수 있습니다. 이 밖에 기타 개인정보
   침해의 신고, 상담에 대하여는 아래의 기관에 문의하시기 바랍니다.

   가. 개인정보분쟁조정위원회 : (국번없이) 1833-6972(www.kopico.go.kr)
   나. 개인정보침해신고센터 : (국번없이) 118(privacy.kisa.or.kr)
   다. 대검찰청 : (국번없이) 1301 (www.spo.go.kr)
   라. 경찰청 : (국번없이) 182 (ecrm.cyber.go.kr)

‣RISS는 정보주체의 개인정보자기결정권을 보장하고, 개인정보침해로 인한 상담 및 피해 구제를
    위해 노력하고 있으며, 신고나 상담이 필요한 경우 아래의 담당부서로 연락해 주시기 바랍니다.
   ▶ 개인정보 관련 고객 상담 및 신고
      부서명 : 학술진흥부
      담당자 : 길원진
      연락처 : ☎053-714-0149 / (Mail) giltizen@keris.or.kr / (Fax) 053-714-0194
‣「개인정보 보호법」제35조(개인정보의 열람), 제36조(개인정보의 정정·삭제), 제37조(개인정보의
   처리정지 등)의 규정에 의한 요구에 대하여 공공기관의 장이 행한 처분 또는 부작위로 인하여 권리
   또는 이익의 침해를 받은 자는 행정심판법이 정하는 바에 따라 행정심판을 청구할 수 있습니다.
   ※ 행정심판에 대해 자세한 사항은 중앙행정심판위원회(www.simpan.go.kr) 홈페이지를 참고
   하시기 바랍니다.

제14조(추가적인 이용ㆍ제공 판단기준)

RISS는 「개인정보 보호법」제15조제3항 및 제17조제4항에 따라 「개인정보 보호법 시행령」
제14조의2에 따른 사항을 고려하여 정보주체의 동의 없이 개인정보를 추가적으로 이용 · 제공할 수 있습니다.
이에 따라 RISS는 정보주체의 동의 없이 추가적인 이용 · 제공을 하는 경우, 본 개인정보처리방침을
통해 아래와 같은 추가적인 이용 · 제공을 위한 고려사항에 대한 판단기준을 안내드리겠습니다.
     ▶ 개인정보를 추가적으로 이용 · 제공하려는 목적이 당초 수집 목적과 관련성이 있는지 여부
     ▶ 개인정보를 수집한 정황 또는 처리 관행에 비추어 볼 때 추가적인 이용 · 제공에 대한 예측
          가능성이 있는지 여부
     ▶ 개인정보의 추가적인 이용 · 제공이 정보주체의 이익을 부당하게 침해하는지 여부
     ▶ 가명처리 또는 암호화 등 안전성 확보에 필요한 조치를 하였는지 여부

제15조(개인정보 처리방침의 변경)

RISS는 「개인정보 보호법」제30조에 따라 개인정보 처리방침을 변경하는 경우 정보주체가 쉽게
확인할 수 있도록 홈페이지에 공개하고 변경이력을 관리하고 있습니다.
‣ 이 개인정보처리방침은 2023. 1. 31. 부터 적용됩니다.
‣ 이전의 개인정보처리방침은 상단에서 확인할 수 있습니다.

닫기

변경하기 다음에 변경하기 1개월후 변경하기

자동로그아웃 안내

닫기

인증오류 안내

닫기

귀하께서는 휴면계정 전환 후 1년동안 회원정보 수집 및 이용에 대한
재동의를 하지 않으신 관계로 개인정보가 삭제되었습니다.

(참조 : RISS 이용약관 및 개인정보처리방침)

신규회원으로 가입하여 이용 부탁 드리며, 추가 문의는 고객센터로 연락 바랍니다.

- 기존 아이디 재사용 불가

휴면계정 안내

RISS는 [표준개인정보 보호지침]에 따라 2년을 주기로 개인정보 수집·이용에 관하여 (재)동의를 받고 있으며, (재)동의를 하지 않을 경우, 휴면계정으로 전환됩니다.

(※ 휴면계정은 원문이용 및 복사/대출 서비스를 이용할 수 없습니다.)

휴면계정으로 전환된 후 1년간 회원정보 수집·이용에 대한 재동의를 하지 않을 경우, RISS에서 자동탈퇴 및 개인정보가 삭제처리 됩니다.

고객센터 1599-3122

ARS번호+1번(회원가입 및 정보수정)

내보내기 형태를 선택하세요

서지정보의 형식을 선택하세요

EndNote 및 Mendeley 소개

RefWorks 소개

[ 논문 정보 ]

[ 신청자 정보 ]

[ 논문 정보 ]

[ 신청자 정보 ]

제 1 장 총칙

제 1 조 (목적)

제 2 조 (약관의 효력과 변경)

제 3 조 (약관외 준칙)

제 4 조 (용어의 정의)

제 2 장 서비스 이용 계약

제 5 조 (이용계약의 성립)

제 6 조 (이용신청)

제 7 조 (이용계약 승낙의 유보)

제 8 조 (계약사항의 변경)

제 3 장 서비스의 이용

제 9 조 (서비스 이용시간)

제 10 조 (이용자번호 등)

제 11 조 (서비스 이용의 제한 및 이용계약의 해지)

제 12 조 (이용자 게시물의 삭제 및 서비스 이용 제한)

제 13 조 (서비스 제공의 중지 및 제한)

제 14 조 (교육정보원의 의무)

제 15 조 (개인정보보호)

제 16 조 (이용자의 의무)

제 17 조 (광고의 게재)

제 4 장 서비스 이용 요금

제 18 조 (이용요금)

제 5 장 마일리지 정책

제 19 조 (마일리지 정책의 변경)

제 6 장 저작권

제 20 조 (게재된 자료에 대한 권리)

제 7 장 이의 신청 및 손해배상 청구 금지

제 21 조 (이의신청금지)

제 22 조 (손해배상청구금지)

부칙

부칙(개정 2005. 5. 31)

부칙(개정 2010. 1. 1)

부칙(개정 2010. 4 1)

부칙(개정 2017. 1 1)

Ver 8.6 (2023년 1월 31일 ~ )