 |
 |
|
|
|
|
Title: |
US6343288:
Single pass space efficient system and method for generating an approximate quantile in a data set having an unknown size
[ Derwent Title ]

|
Country: |
US United States of America

|
| |
Inventor: |
Lindsay, Bruce Gilbert; San Jose, CA
Manku, Gurmeet Singh; Santa Clara, CA
Rajagopalan, Sridhar; San Jose, CA

|
Assignee: |
International Business Machines Corporation, Armonk, NY
other patents from INTERNATIONAL BUSINESS MACHINES CORPORATION (280070) (approx. 44,393)
News, Profiles, Stocks and More about this company

|
Published / Filed: |
2002-01-29
/ 1999-03-12

|
Application Number: |
US1999000268089

|
IPC Code: |
Advanced:
G06F 17/30;
Core:
more...
IPC-7:
G06F 17/30;

|
ECLA Code: |
G06F17/30S4P8D;

|
U.S. Class: |
Current:
707/007;
707/002;
707/006;
707/101;
Original:
707/007;
707/002;
707/006;
707/101;

|
Field of Search: |
707/002,6,7,101

|
Priority Number: |
| 1999-03-12 |
US1999000268089 |

|
Abstract: |
A space-efficient system and method for generating an approximate phi-quantile data element of a data set in a single pass over the data set, without a priori knowledge of the size of the data set. The approximate phi-quantile is guaranteed to lie within a user-specified approximation error epsi of the true quantile being sought with a probability of at least 1-delta, with delta being a user-defined probability of failure. B buffers, each having a capacity of k elements, initially are filled with elements from the data set, with the values of b and k depending on approximation error e and the probability delta. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output remains. The element of the output corresponding to the approximate quantile is then output as the approximate quantile. In later iterations (when the height of the tree is at least equal to a predetermined height that depends on delta and epsi), the data is sampled non-uniformly to populate the buffers to render the desired performance. Parallel processors can be used, with the final output buffers of the processors being sent to a collecting processor P0 as input buffers to the collecting processor P0.

|
Attorney, Agent or Firm: |
Rogitz, John L. ;

|
Primary / Asst. Examiners: |
Mizrahi, Diane D.;

|
INPADOC Legal Status: |
Show legal status actions

|
Parent Case: |
RELATED APPLICATIONS
This application is related to U.S. patent application Ser. No. 09/050,434 filed Mar. 30, 1998 now U.S. Pat. No. 6,108,658 for an invention entitled "SINGLE PASS SPACE EFFICIENT SYSTEM AND METHOD FOR GENERATING AN APPROXIMATE QUANTILE IN A DATA STREAM THAT SATISFIES AN APRIORI USER-DEFINED APPROXIMATION ERROR", owned by the present assignee and incorporated herein by reference.

|
Family: |
None

|
First Claim:
Show all 48 claims |
We claim:
1. A computer system, comprising:
- a general purpose computer;
- one or more input devices associated with the computer for generating a user specification, the specification establishing at least one desired approximate φ-quantile, a quantile approximation error ε, and a probability of failure δ, wherein the approximate φ-quantile is guaranteed to represent a true quantile of a data set and lie within the quantile approximation error ε with a probability of at least 1-δ;
- a data set having a size, the size being unavailable to the computer in advance; and
- computer usable code means executable by the computer for determining a φ-quantile data element in the data set, the computer usable code means having:
- means for determining a number b of buffers and a size k of each buffer, and a number h, based at least in part on the permissible approximation error ε and the probability of failure δ;
- means for filling empty buffers with data elements to establish a plurality of input buffers;
- means for collapsing data elements in input buffers into at least one output buffer; and
- means for outputting, from an output buffer, at least one φ-quantile data element.

|
Background / Summary: |
Show background / summary

|
Drawing Descriptions: |
Show drawing descriptions

|
Description: |
Show description

|
Forward References: |
Show 3 U.S. patent(s) that reference this one

|
 |
 |
|
|
|
|
Foreign References: |
None

|
Other References: |
Fu, et al. (IEEE publication, 2001) paper entitled "Novel algorithms for computing medians and other quantiles of dis-resident data" in Database Engineering and Application, 2001 International Symposium, pp. 145-154.*
Publication: "A One-Pass Space-Efficient Algorithm for Finding Quantiles". Agrawal et al. Proceedings of the 7th International Conference on Management of Data (COMAD-95). India. 1995.
Publication: "A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data." Alsabti et al. pp. 346-355. Proceedings of the 23rd VLDB Conference. Athens, Greece, 1997.
Publication: "Efficient Sampling Strategies for Relational Database Operations." Lipton et al. pp. 195-226. Theoretical Computer Science. vol. 116. 1993.
(32 pages)
Cited by 8 patents
[ISI abstract]
Publication: "A Minimum Storage Algorithm for Computing the Median." Pohl. IBM Research Center, Combinatorial Mathematics. Nov. 1969.
Book: "An Introduction to Probability Theory and its Applications". Feller. Princeton Univ. vol. 1, Third Edition. pp. 212-242. 1968.
Publication: "The P2 Algorithm for Dynamic Calculation of Quantiles and Histograms Without Storing Observations." Jain et al. Communications of the ACM. vol. 28, No. 10. pp. 1076-1085. Oct. 1985.
(10 pages)
Cited by 4 patents
Publication: "Mining Association Rules Between Sets of Items in Large Databases" Agrawal et al. pp. 207-216. ACM 1993.
Publication: "Equidepth Partitioning of a Data Set Based on Finding its Medians." Gurajada et al. pp. 92-101. IEEE. 1991.
Publication: "Quantile Estimation From Grouped Data: The Cell Midpoint." B. W. Schmeiser et al. pp. 221-234. Communications in Statistics: Simulation and Computation. pp. 221-234. 1977.
(14 pages)
Cited by 3 patents
Publication: "The Generation of Order Statistics in Digital Computer Simulation: A Survey." B.W. Schmeiser: Proceedings of The Winter Simulation Conference. pp. 137-140. 1978.
Publication: "Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries." Muralikrishna et al. pp. 28-36. ACM SIGMOD. 1988.
Publication: "Accurate Estimation of the Number of Tuples Satisfying a Condition." Piatetsky-Shapiro et al. pp. 256-275. SIGMOD. Proceedings of Annual Meeting. Boston, MA. 1984.
Book: "Selection and Sorting with Limited Storage". Munro et al. Theoretical Computer Science, vol. 12. pp. 315-323. 1980.
(9 pages)
Cited by 3 patents
Publication: "A Theory of the Learnable." L.G. Valiant. Communication of the ACM. vol. 27, No. 11. pp. 1134-1142. Nov. 1984.
(9 pages)
Cited by 3 patents
Publication: "Improved Histograms for Selectivity Estimation of Range Predicates." Poosala et al. SIGMOD. pp. 294-305. Montreal, Canada. 1996.
Publication: "Selecting the Median." Dor et al. Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete algorithms. Chpt. 4, pp. 28-37. San Francisco, CA. 1995.
Publication: "Parallel Sorting on a Shared-Nothing Architecture Using Probabilistic Splitting." DeWitt et al. Proceedings of the 1st International Conference on Parallel and Distributed Information Systems. pp. 280-291. Florida. Dec. 1991.
Publication: "Access Path Selection in a Relational Database Management System." Selinger et al. Proceedings of the ACM-SIGMOD International Conference on Management of Data. pp. 23-34. Boston, MA. 1979.
Publication: "Progress in Selection." Mike Paterson. Dept. Computer Science. Univ. Warwick. United Kingdom. 1977.

|


|
Nominate this for the Gallery...

|
|