 |
 |
|
|
|
|
Title: |
US6233575:
Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
[ Derwent Title ]

|
Country: |
US United States of America

|
| |
Inventor: |
Agrawal, Rakesh; San Jose, CA
Chakrabarti, Soumen; San Jose, CA
Dom, Byron Edward; Los Gatos, CA
Raghavan, Prabhakar; Saratoga, CA

|
Assignee: |
International Business Machines Corporation, Armonk, NY
other patents from INTERNATIONAL BUSINESS MACHINES CORPORATION (280070) (approx. 44,393)
News, Profiles, Stocks and More about this company

|
Published / Filed: |
2001-05-15
/ 1998-06-23

|
Application Number: |
US1998000102861

|
IPC Code: |
Advanced:
G06F 17/30;
Core:
more...
IPC-7:
G06F 17/30;

|
ECLA Code: |
G06F17/30T4M;

|
U.S. Class: |
Current:
707/006;
706/012;
707/002;
707/E17.091;
Original:
707/006;
707/002;
706/012;

|
Field of Search: |
707/001-10,100-104,200-206,500-503,511-516,531-536,907
706/012-21,25-28,45-55,60-61,934
382/156-157

|
Priority Number: |
| 1998-06-23 |
US1998000102861 |
| 1997-06-24 |
US1997000050611P |

|
Abstract: |
A system, process, and article of manufacture for organizing a large text database into a hierarchy of topics and for maintaining this organization as documents are added and deleted and as the topic hierarchy changes. Given sample documents belonging to various nodes in the topic hierarchy, the tokens (terms, phrases, dates, or other usable feature in the document) that are most useful at each internal decision node for the purpose of routing new documents to the children of that node are automatically detected. Using feature terms, statistical models are constructed for each topic node. The models are used in an estimation technique to assign topic paths to new unlabeled documents. The hierarchical technique, in which feature terms can be very different at different nodes, leads to an efficient context-sensitive classification technique. The hierarchical technique can handle millions of documents and tens of thousands of topics. A resulting taxonomy and path enhanced retrieval system (TAPER) is used to generate context-dependent document indexing terms. The topic paths are used, in addition to keywords, for better focused searching and browsing of the text database.

|
Attorney, Agent or Firm: |
Gates & Cooper LLP ;

|
Primary / Asst. Examiners: |
Breene, John; Channavajjala, Srirama

|
INPADOC Legal Status: |
Show legal status actions
Family Legal Status Report

|
Parent Case: |
PROVISIONAL APPLICATION
The present application claims the benefit of U.S. Provisional Application Ser. No. 60/050,611, entitled "USING TAXONOMY, DISCRIMINANTS, AND SIGNATURES FOR NAVIGATING IN TEXT DATABASES", filed Jun. 24, 1997, by Rakesh Agrawal, et al., which is incorporated herein by reference, in its entirety.

|
Family: |
Show 3 known family members

|
First Claim:
Show all 32 claims |
What is claimed is:
1. A process for classifying new documents containing features under nodes defining a multilevel taxonomy, based on features derived from a training set of documents that have been classified under respective nodes of the taxonomy, the process comprising:
- associating a respective set of features with each one of said plurality of nodes, each given set of features comprising a plurality of features that are in at least one training document classified under the associated node; and
- classifying each new document under at least one node, based on the set of features associated with said at least one node, further comprising:
- determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy, wherein the discrimination value comprises a Fisher value based on the equation: [Figure]
- where t represents a term, d represents a document, c represents a class, [Figure]
- determining a minimum discrimination value for each of said plurality of nodes;
- wherein the features in each given set of features have discrimination values equal to or above the minimum discrimination value determined for the node associated with the given set of features.

|
Background / Summary: |
Show background / summary

|
Drawing Descriptions: |
Show drawing descriptions

|
Description: |
Show description

|
Forward References: |
Show 169 U.S. patent(s) that reference this one

|
 |
 |
|
|
|
|
U.S. References: |
Go to Result Set:
All U.S. references
| Forward references (169)
|
Backward references (20)
|
Citation Link

Buy PDF |
Patent |
Pub.Date |
Inventor |
Assignee |
Title |
 |
US4975975 |
1990-12 |
Filipski |
GTX Corporation |
Hierarchical parametric apparatus and method for recognizing drawn characters
|
 |
US5168565 |
1992-12 |
Morita |
Ricoh Company, Ltd. |
Document retrieval system
|
 |
US5317507 |
1994-05 |
Gallant |
|
Method for document retrieval and for word sense disambiguation using neural networks
|
 |
US5325298 |
1994-06 |
Gallant |
HNC, Inc. |
Methods for generating or revising context vectors for a plurality of word stems
|
 |
US5418946 |
1995-05 |
Mori |
Fuji Xerox Co., Ltd. |
Structured data classification device
|
 |
US5428778 |
1995-06 |
Brookes |
Office Express Pty. ltd. |
Selective dissemination of information
|
 |
US5469354 |
1995-11 |
Hatakeyama et al. |
Hitachi, Ltd. |
Document data processing method and apparatus for document retrieval
|
 |
US5506984 |
1996-04 |
Miller |
Digital Equipment Corporation |
Method and system for data retrieval in a distributed system using linked location references on a plurality of nodes
|
 |
US5519857 |
1996-05 |
Kato et al. |
Hitachi, Ltd. |
Hierarchical presearch type text search method and apparatus and magnetic disk unit used in the apparatus
|
 |
US5535382 |
1996-07 |
Ogawa |
Ricoh Company, Ltd. |
Document retrieval system involving ranking of documents in accordance with a degree to which the documents fulfill a retrieval condition corresponding to a user entry
|
 |
US5557794 |
1996-09 |
Matsunaga et al. |
Fuji Xerox Co., Ltd. |
Data management system for a personal data base
|
 |
US5568640 |
1996-10 |
Nishiyama et al. |
Hitachi, Ltd. |
Document retrieving method in a document managing system
|
 |
US5576954 |
1996-11 |
Driscoll |
University of Central Florida |
Process for determination of text relevancy
|
 |
US5600827 |
1997-02 |
Nakabayashi et al. |
Seiko Epson Corporation |
Data management, display, and retrival system for a hierarchical collection
|
 |
US5625767 |
1997-04 |
Bartell et al. |
|
Method and system for two-dimensional visualization of an information taxonomy and of text documents based on topical content of the documents
|
 |
US5659724 |
1997-08 |
Borgida et al. |
NCR |
Interactive data analysis apparatus employing a knowledge base
|
 |
US5675710 |
1997-10 |
Lewis |
Lucent Technologies, Inc. |
Method and apparatus for training a text classifier
|
 |
US5826260 |
1998-10 |
Byrd, Jr. et al. |
International Business Machines Corporation |
Information retrieval system and method for displaying and ordering information based on query element contribution
|
 |
US5838816 |
1998-11 |
Holmberg |
Hughes Electronics |
Pattern recognition system providing automated techniques for training classifiers for non stationary elements
|
 |
US5918240 |
1999-06 |
Kupiec et al. |
Xerox Corporation |
Automatic method of extracting summarization using feature probabilities
|
|
 |
 |
|
|
|
|
Foreign References: |

|
Other References: |
Ho, T.K. et al., decision combination in multiple classifier systems, IEEE transactions on pattern analysis and machine intelligence, vol. 16, No. 1, pp 66-75, Jan. 1994.*
Soumen Chakrabarti et al., Enhanced hypertext categorization using hyperlinks, proceedings of ACM SIGMOD international conference on Management of data, and 307-318, Jun. 1998.*
Yuwono, B et al., search and ranking algorithms for locating resources on world wide web, proceedings of the 12th international conference, pp 164-171, Mar. 1996.*
Hill, P. et al., "Multiple Views of Product Information", IBM Technical Disclosure Bulletin, vol. 39, No. 02, pp. 17-24 (Feb. 1996).
Rus, D. et al., "Using Non-Textual Cues for Electronic Document Browsing", Digital Libraries Workshop DL '94, Newark, NJ, USA, May 19-20, 1994 Selected Papers, Chapter 9, pp. 129-162.
Koller, D. et al., "Hierarchically Classifying Documents Using Very Few Words", The Fourteenth International Conference on Machine Learning, pp. 170-178 (Jul. 1997).
Mladenic D., "Feature Subset Selection in Text-Learning", 10th European Conference on Machine Learning, pp. 95-100, (1998).
Yang, Y. et al., "A Comparative Study on Feature Selection in Text Categorization", International Conference on Machine Learning, pp. 412-420 (Jul. 1997).
Apte, C. et al., "Automated Learning of Decision Rules for Text Categorization", IBM Research Report RC 18879. To Appear in ACM Transactions on Information Systems, pp. 1-20 (no date).; vol. 12, Issue 3, accepted Mar. 1994.
Schutze, H. et al., "A Comparison of Classifiers and Document Representations for the Routing Problem", Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 229-237 (Jul. 1995).
Lewis, D., "Evaluating Text Categorization", Proceedings of the Speech and Natural Language Workshop, Asilomar, pp. 312-318 (Feb. 1991).
Lewis, D., "Feature Selection and Feature Extraction for Text Categorization", Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York pp. 212-217 (Feb. 1992).
Koller, D., "Toward Optimal Feature Selection", In Lorenza Saitta, ed., Machine Learning: Proc. Of the Thirteenth International Conference, Morgan Kaufmann, 9 pages, (1996).
Panyr, J., "STEINADLER--a system of automatic description and classification of documents", Nachr. Dok, vol. 29, No. 4-5, pp. 184-191 (Sep. 1978) (Abstract in English). Abstract in English Only Considered.

|
Continuity Data: |
| Application Number | Filed | Notes |
|
|
US2001000777278 | 2001-02-05 | is a
division of |
|
>US1998000102861<
| 1998-06-23 |
(granted)
|
| |
US6233575 issued 2001-05-15 Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
|
|
|
|
US1998000102861 | 1998-06-23 | is a
non-provisional of provisional |
|
US1997000050611P
| 1997-06-24 |
|

|


|
Nominate this for the Gallery...

|
|