 |
 |
|
|
|
|
Title: |
US6560597:
Concept decomposition using clustering
[ Derwent Title ]

|
Country: |
US United States of America

|
| |
Inventor: |
Dhillon, Inderjit Singh; Austin, TX
Modha, Dharmendra Shantilal; San Jose, CA

|
Assignee: |
International Business Machines Corporation, Armonk, NY
other patents from INTERNATIONAL BUSINESS MACHINES CORPORATION (280070) (approx. 44,393)
News, Profiles, Stocks and More about this company

|
Published / Filed: |
2003-05-06
/ 2000-03-21

|
Application Number: |
US2000000528941

|
IPC Code: |
Advanced:
G06F 17/30;
Core:
more...
IPC-7:
G06F 17/30;

|
ECLA Code: |
G06F17/30T4M;

|
U.S. Class: |
707/004;
707/006;
707/102;

|
Field of Search: |
707/004,5,3,102,6
709/201

|
Priority Number: |
| 2000-03-21 |
US2000000528941 |

|
Abstract: |
A system and method operates with a document collection in which documents are represented as normalized document vectors. The document vector space is partitioned into a set of disjoint clusters and a concept vector is computed for each partition, the concept vector comprising the mean vector of all the documents in each partition. Documents are then reassigned to the cluster having their closest concept vector, and a new set of concept vectors for the new partitioning is computed. From an initial partitioning, the concept vectors are iteratively calculated to a stopping threshold value, leaving a concept vector subspace of the document vectors. The documents can then be projected onto the concept vector subspace to be represented as a linear combination of the concept vectors, thereby reducing the dimensionality of the document space. A search query can be received for the content of text documents and a search can then be performed on the projected document vectors to identify text documents that correspond to the search query.

|
Attorney, Agent or Firm: |
Hall, David A.Heller Ehrman White & McAuliffe ;

|
Primary / Asst. Examiners: |
Mizrahi, Diane D.;

|
INPADOC Legal Status: |
Show legal status actions

|
Family: |
None

|
First Claim:
Show all 27 claims |
We claim:
1. A method of operating a computer system to represent text documents stored in a database collection, comprising:
- representing the text documents in a vector representation format in which there are n documents and d words;
- normalizing the document vectors;
- determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;
- computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and
- projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.

|
Background / Summary: |
Show background / summary

|
Drawing Descriptions: |
Show drawing descriptions

|
Description: |
Show description

|
Forward References: |
Show 24 U.S. patent(s) that reference this one

|
 |
 |
|
|
|
|
Foreign References: |
None

|
Other Abstract Info: |
DERABS C2003-539858

|
Other References: |
Drineas et al., "Clustering in large graphs and matrices," SODA pp. 291-299 (1999).
Duda et al., "Pattern Classification and Scene Analysis," John Wiley & Sons pp. 211-228 and pp. 252-256 (1973).
Kleinberg et al., "A Microeconomic View of Data Mining," Department of Computer Science, Cornell University pp. 1-14 (1998).
Sahami et al., Real-time Full-text Clustering of Networked Documents, Stanford University, pp. 1-3 (1998).

|


|
Nominate this for the Gallery...

|
|