 |
 |
|
|
|
|
Title: |
US6253169:
Method for improvement accuracy of decision tree based text categorization
[ Derwent Title ]

|
Country: |
US United States of America

|
| |
Inventor: |
Apte, Chidanand; Chappaqua, NY
Damerau, Frederick J.; North Salem, NY
Weiss, Sholom M.; Highland Park, NJ

|
Assignee: |
International Business Machines Corporation, Armonk, NY
other patents from INTERNATIONAL BUSINESS MACHINES CORPORATION (280070) (approx. 44,393)
News, Profiles, Stocks and More about this company

|
Published / Filed: |
2001-06-26
/ 1998-05-28

|
Application Number: |
US1998000084985

|
IPC Code: |
Advanced:
G06F 17/30;
Core:
more...
IPC-7:
G06E 17/27;
G06E 21/00;

|
ECLA Code: |
G06F17/30T4M;

|
U.S. Class: |
Current:
704/009;
707/E17.091;
715/236;
Original:
704/009;
707/531;

|
Field of Search: |
701/001,9,10
707/530,531,532,102

|
Priority Number: |
| 1998-05-28 |
US1998000084985 |

|
Abstract: |
A text categorization method automatically classifies electronic documents by developing a single pooled dictionary of words for a sample set of documents, and then generating a decision tree model, based on the pooled dictionary, for classifying new documents. Adaptive resampling techniques are applied to improve the accuracy of the decision tree model.

|
Attorney, Agent or Firm: |
McGuireWoods, LLP ;
Kaufman, Esq., Stephen C. ;

|
Primary / Asst. Examiners: |
Isen, Forester W.; Edouard, Patrick N.

|
INPADOC Legal Status: |
Show legal status actions

|
Family: |
None

|
First Claim:
Show all 18 claims |
What is claimed is:
1. A text categorization method comprising:
- obtaining a collection of electronic documents;
- defining a sample set of documents from the collection;
- classifying the documents in the sample set in accordance with steps which include:
- (a) analyzing words in the documents ofthe sample set to identify a plurality of topics,
- (b) developing a plurality of local dictionaries, each containing words descriptive of a respective one of said plurality of topics, and
- (c) developing vectors for each of the documents in the sample set, with the vectors developed for each document in the sample set being indicative of words in a respective one of said plurality of local dictionaries developed for a respective one of said plurality of topics;
- forming a prediction model based on the classification of the documents in the sample set performed in said classifying step, said forming step including:
- (d) forming a plurality of decision trees for said plurality of topics, respectively, said decision trees each being formed based on the vectors developed for the documents in said sample for a respective one of said plurality of topics;
- classifying a new document based on the prediction model,
- wherein the step of classifing the documents in the sample set includes combining said plurality of local dictionaries into a single pooled dictionary, said single pooled dictionary containing sorted words with duplicate words removed, and
- wherein the step of classifying a new document based on the prediction model includes:
- identifying words in the new document which correspond to words in said single pooled dictionary;
- forming said words into groups belonging to respective ones of said plurality of topics;
- applying said plurality of decision trees to said groups to derive classification outcomes, each of said classification outcomes being generated by applying one of said plurality of decision trees to a respective one of said groups relative to one of said plurality of topics; and
- classifying the new document into at least one of said plurality of topics based on said classification outcomes.

|
Background / Summary: |
Show background / summary

|
Drawing Descriptions: |
Show drawing descriptions

|
Description: |
Show description

|
Forward References: |
Show 38 U.S. patent(s) that reference this one

|
|