Work Files Saved Searches
   My Account                                                  Search:   Quick/Number   Boolean   Advanced       Help   


 The Delphion Integrated View

  Buy Now:   Buy PDF- 16pp  PDF  |   File History  |   Other choices   
  Tools:  Citation Link  |  Add to Work File:    
  View:  Expand Details   |  INPADOC   |  Jump to: 
 
 Email this to a friend  Email this to a friend 
       
Title: US6092038: System and method for providing lossless compression of n-gram language models in a real-time decoder
[ Derwent Title ]


Country: US United States of America

View Images High
Resolution

 Low
 Resolution

 
16 pages

 
Inventor: Kanevsky, Dimitri; Ossining, NY
Rao, Srinivasa Patibandla; Jericho, NY

Assignee: International Business Machines Corporation, Armonk, NY
other patents from INTERNATIONAL BUSINESS MACHINES CORPORATION (280070) (approx. 44,393)
 News, Profiles, Stocks and More about this company

Published / Filed: 2000-07-18 / 1998-02-05

Application Number: US1998000019012

IPC Code: Advanced: H03M 7/30;
Core: more...
IPC-7: G06F 17/27; G06F 17/28;

ECLA Code: H03M7/30;

U.S. Class: Current: 704/009; 704/257;
Original: 704/009; 704/257;

Field of Search: 704/009,10,1,251,255,256,257,243,270,277 707/530,531,532,533,534,104

Priority Number:
1998-02-05  US1998000019012

Abstract: System and methods for compressing (losslessly) n-gram language models for use in real-time decoding, whereby the size of the model is significantly reduced without increasing the decoding time of the recognizer. Lossless compression is achieved using various techniques. In one aspect, n-gram records of an N-gram language model are split into (i) a set of common history records that include subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records. The common history records are separated into a first group of common history records each having only one hypothesis record associated therewith and a second group of common history records each having more than one hypothesis record associated therewith. The first group of common history records are stored together with their corresponding hypothesis record in an index portion of a memory block comprising the N-gram language model and the second group of common history records are stored in the index together with addresses pointing to a memory location having the corresponding hypothesis records. Other compression techniques include, for instance, mapping word records of the hypothesis records into word numbers and storing a difference value between subsequent word numbers; segmenting the addresses and storing indexes to the addresses in each segment to multiples of the addresses; storing word records and probability records as fractions of bytes such that each pair of word-probability records occupies a multiple of bytes and storing flags indicating the length; and storing the probability records as indexes to sorted count values that are used to compute the probability on the run.

Attorney, Agent or Firm: F.Chau & Associates, LLP ;

Primary / Asst. Examiners: Thomas, Joseph;

Maintenance Status: E2 Expired  Check current status

INPADOC Legal Status: Show legal status actions

Family: None

First Claim:
Show all 32 claims
What is claimed is:     1. A method for losslessly compressing an n-gram language model for storage in a storage device, the n-gram language model comprising a plurality of n-gram records generated from a training vocabulary, each n-gram record comprising an n-gram in the form of a series of "n-tuple" words (w1, w2, . . . wn), a count and a probability associated therewith, each n-gram having a history represented by the initial n-1 words of the n-gram, said method comprising the steps of:
  • splitting said plurality of n-gram records into (i) a set of common history records comprising subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records, each set of hypothesis records including at least one hypothesis record comprising a word record-probability record pair;
  • partitioning said common history records into at least a first group and a second group, said first group comprising each common history record having a single hypothesis record associated therewith, said second group comprising each common history record having more than one hypothesis record associated therewith;
  • storing said hypothesis records associated with said second group of common history records in said storage device; and
  • storing, in an index portion of said storage device, (i) each common history record of said second group together with an address that points to a location in said storage device having corresponding hypothesis records and (ii) each common history record of said first group together with its corresponding single hypothesis record.


Background / Summary: Show background / summary

Drawing Descriptions: Show drawing descriptions

Description: Show description

Forward References: Show 58 U.S. patent(s) that reference this one

       
U.S. References: Go to Result Set: All U.S. references   |  Forward references (58)   |   Backward references (6)   |   Citation Link

Buy
PDF
Patent  Pub.Date  Inventor Assignee   Title
Get PDF - 12pp US4342085  1982-07 Glickman, et al.  International Business Machines Corporation Stem processing for data reduction in a dictionary storage file
Get PDF - 21pp US5467425  1995-11 Lau, et al.  International Business Machines Corporation Building scalable N-gram language models using maximum likelihood maximum entropy N-gram models
Get PDF - 16pp US5649060  1997-07 Ellozy et al.  International Business Machines Corporation Automatic indexing and aligning of audio and text using speech recognition
Get PDF - 16pp US5724593  1998-03 Hargrave, III, et al.  International Language Engineering Corp. Machine assisted translation tools
Get PDF - 11pp US5794249  1998-08 Orsolono et al.  Hewlett-Packard Company Audio/video retrieval system that uses keyword indexing of digital recordings to display a list of the recorded text files, keywords and time stamps associated with the system
Get PDF - 14pp US5835888  1998-11 Kanevshy et al.  International Business Machines Corporation Statistical language model for inflected languages
       
Foreign References: None

Other Abstract Info: DERABS G2000-617929 DERABS G2000-617929

Inquire Regarding Licensing

Powered by Verity


Plaques from Patent Awards      Gallery of Obscure PatentsNominate this for the Gallery...

Thomson Reuters Copyright © 1997-2014 Thomson Reuters 
Subscriptions  |  Web Seminars  |  Privacy  |  Terms & Conditions  |  Site Map  |  Contact Us  |  Help