Work Files Saved Searches
   My Account                                                  Search:   Quick/Number   Boolean   Advanced   Derwent    Help   


 The Delphion Integrated View

  Buy Now:   Buy PDF- 23pp  PDF  |   File History  |   Other choices   
  Tools:  Citation Link  |  Add to Work File:    
  View:  Expand Details   |  INPADOC   |  Jump to: 
  Go to:  Derwent  
 Email this to a friend  Email this to a friend 
       
Title: US6253169: Method for improvement accuracy of decision tree based text categorization
[ Derwent Title ]


Country: US United States of America

View Images High
Resolution

 Low
 Resolution

 
23 pages

 
Inventor: Apte, Chidanand; Chappaqua, NY
Damerau, Frederick J.; North Salem, NY
Weiss, Sholom M.; Highland Park, NJ

Assignee: International Business Machines Corporation, Armonk, NY
other patents from INTERNATIONAL BUSINESS MACHINES CORPORATION (280070) (approx. 44,393)
 News, Profiles, Stocks and More about this company

Published / Filed: 2001-06-26 / 1998-05-28

Application Number: US1998000084985

IPC Code: Advanced: G06F 17/30;
Core: more...
IPC-7: G06E 17/27; G06E 21/00;

ECLA Code: G06F17/30T4M;

U.S. Class: Current: 704/009; 707/E17.091; 715/236;
Original: 704/009; 707/531;

Field of Search: 701/001,9,10 707/530,531,532,102

Priority Number:
1998-05-28  US1998000084985

Abstract:     A text categorization method automatically classifies electronic documents by developing a single pooled dictionary of words for a sample set of documents, and then generating a decision tree model, based on the pooled dictionary, for classifying new documents. Adaptive resampling techniques are applied to improve the accuracy of the decision tree model.

Attorney, Agent or Firm: McGuireWoods, LLP ; Kaufman, Esq., Stephen C. ;

Primary / Asst. Examiners: Isen, Forester W.; Edouard, Patrick N.

INPADOC Legal Status: Show legal status actions

Family: None

First Claim:
Show all 18 claims
What is claimed is:     1. A text categorization method comprising:
  • obtaining a collection of electronic documents;
  • defining a sample set of documents from the collection;
  • classifying the documents in the sample set in accordance with steps which include:
    • (a) analyzing words in the documents ofthe sample set to identify a plurality of topics,
    • (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and
    • (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
  • forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including:
    • (d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
  • classifying a new document based on the prediction model,
  • wherein the step of classifing the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and
  • wherein the step of classifying a new document based on the prediction model includes:
  • identifying words in the new document which correspond to words in said single pooled dictionary;
  • forming said words into groups belonging to respective ones of said plurality of topics;
  • applying said plurality of decision trees to said groups to derive classification outcomes, each of said classification outcomes being generated by applying one of said plurality of decision trees to a respective one of said groups relative to one of said plurality of topics; and
  • classifying the new document into at least one of said plurality of topics based on said classification outcomes.


Background / Summary: Show background / summary

Drawing Descriptions: Show drawing descriptions

Description: Show description

Forward References: Show 38 U.S. patent(s) that reference this one

       
U.S. References: Go to Result Set: All U.S. references   |  Forward references (38)   |   Backward references (15)   |   Citation Link

Buy
PDF
Patent  Pub.Date  Inventor Assignee   Title
Buy PDF- 15pp US5182708  1993-01 Ejiri  Ricoh Corporation Method and apparatus for classifying text
Buy PDF- 20pp US5317507  1994-05 Gallant   Method for document retrieval and for word sense disambiguation using neural networks
Buy PDF- 20pp US5371807  1994-12 Register et al.  Digital Equipment Corporation Method and apparatus for text classification
Buy PDF- 17pp US5463773  1995-10 Sakakibara et al.  Fujitsu Limited Building of a document classification tree by recursive optimization of keyword selection function
Buy PDF- 17pp US5526443  1996-06 Nakayama  Xerox Corporation Method and apparatus for highlighting and categorizing documents using coded word tokens
Buy PDF- 44pp US5619709  1997-04 Caid et al.  HNC, Inc. System and method of context vector generation and retrieval
Buy PDF- 19pp US5659766  1997-08 Daund et al.  Xerox Corporation Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision
Buy PDF- 14pp US5675710  1997-10 Lewis  Lucent Technologies, Inc. Method and apparatus for training a text classifier
Buy PDF- 29pp US5675819  1997-10 Schuetze  Xerox Corporation Document information retrieval using global word co-occurrence patterns
Buy PDF- 19pp US5687364  1997-11 Saund et al.  Xerox Corporation Method for learning to infer the topical content of documents based upon their lexical content
Buy PDF- 12pp US5732260  1998-03 Nomiyama  International Business Machines Corporation Information retrieval system and method
Buy PDF- 45pp US5794178  1998-08 Caid et al.  HNC Software, Inc. Visualization of information using graphical representations of context vector based relationships and attributes
Buy PDF- 28pp US5819259  1998-10 Duke-Moran et al.  Hartford Fire Insurance Company Searching media and text information and categorizing the same employing expert system apparatus and methods
Buy PDF- 30pp US5963205  1999-10 Sotomayor  Iconovex Corporation Automatic index creation for a word processor
Buy PDF- 10pp US5969442  1999-09 Prasad  Motorola, Inc. Reaction propulsion motor and apparatus for using the same
       
Foreign References: None

Other References:
  • C. Apte et al.; "Automated Learning of Decision Rules for Text Categorization"; IBM Research Report RC 18879; ACM Transactions on Information Systems; pp 1-20.
  • C. Apte et al; "Towards Language Independent Automated Learning of Text Categorization Models"; IBM Research Report RC 19481; In proceedings of ACM SIGIR '94.
  • S. Weiss et al.; "Predictive Data Mining--A Practical Guide"; Morgan Kaufmann Publishers, Inc.; 1998; pp. 135-199.
  • I. Dagan et al; "Mistake-Driven Learning in Text Categorization"; Empirical Methods in NKP; Aug. 1997.
  • Y. Freund et al.; "Experiments with a New Boosting Algorithm"; AT&T Corp; 1996; pp. 148-156.
  • P. Hayes et al.; "Adding Value to Financial News by Computer"A1 on Wall Street; Oct. 1991; pp. 2-8.
  • P. Hayes et al.; "TCS: A Shell for Content-Based Text Categorization"; CH2842-3/90/0000/0320$01.00; IEEE; 1990; pp. 320-326.
  • T. Joachims; "Text Categorization with Support Vector Machines: Learning with Many Relevant Features"; Universitat Dortmund, Dortmund, Germany; pp. 1-15.
  • D. Lewis; "Feature Selection and Feature Extraction for Text Categorization"; University of Chicago; pp. 212-217.
  • Y. Yang; "An Evaluation of Statistical Approaches to Text Categorization"; Carnegie Mellon University; Apr. 1997; pp. 1-10.
  • Y. Yang (Carnegie Mellon University) and Jan Pedersen (Verity, Inc); "A Comparative Study on Feature Selection in Text Categorization".
  • Apte et al ., "Data Mining with Decision Trees and Decision rules", Future Generation Computer Systems, Nov. 1997, pp. 1-13.


  • Inquire Regarding Licensing

    Powered by Verity


    Plaques from Patent Awards      Gallery of Obscure PatentsNominate this for the Gallery...

    Thomson Reuters Copyright © 1997-2010 Thomson Reuters 
    Subscriptions  |  Web Seminars  |  Privacy  |  Terms & Conditions  |  Site Map  |  Contact Us  |  Help