SPECIALIST Lexicon

N-gram Set by Prediction Filter

A new approach of prediction filter is developed to resolve issues of limited memory. This approache retrieve an approximate n-Gram as an alternative approach. However, this approach is not comprehensive and should be replace by a more thorough approach.

I. Prediction Filter:

Use the frequency (NWC) of normalized (n)-gram terms as filter to generate (n+1)-gram terms:

Only high frequency n-gram terms are used to generate multiwords for our applications (to add in to Lexicon). Low frequency n-gram terms are usually error prone multiwords (e.g. typos).
Among these n-gram terms, most of them has very low frequency (as the long tail of dinosaur shown on the single words frequency spectrum)
The frequency (WC) of normalized n-gram terms is used to reflect the real usage.
If a n-gram term has low NWC, the associated (n+1)-gram terms has lower (or equals) NWC. For example, unigram term "zytochemical" has NWC of 2. The associated bigram of "zytochemical" is "zytochemical findings" is also 2.
Let says the n-gram term is "n-gram term", the associated (n+1)-gram terms must be:
- "prefix n-gram term"
- "n-gram term suffix"
Use the frequency (NWC) of normalized (n)-gram terms as filter to generate (n+1)-gram terms to generate high frequency (n+1) gram terms that fit into to our computing resources (60Gb, which stores about 230 million of n-gram terms)a In other words, this approach retrieve the top NWC (frequency) 230 million of n-gram terms (for n = 1 ~ 5, respectively).

II. N-gram Set with Prediction Filter:

Get the nGram (n=1, uniGram) from section I.
Run the normalization on nGram and retrieve uniGram with normalized terms|word count
Manually review results and select a threshold (wc >= 50)
Retrieve a table with all original terms with normalized terms above the selected threshold
Generate biGram (n+1) as follows:
- Filter out the biGram ((n+1)-gram) term if either the starting or ending words in biGram terms are not in the unigram (n-gram) threshold table (below the threshold, because the frequency is too low). These low frequency term will be filtered out after they are generated (even we don't filter them out at this stage anyway).
Repeats the above processes for n = 2 ~ 5

III. Example Walk-through (MEDLINE.2014):

	Preprocess	unigrams	bigrams	trigrams	fourgrams	fivegrams
N		n=1	n=2	n=3	n=4	n=5
Step 1 PmidTiAbSentences{YY}n{DDDD}.txt	45 min. PmidTiAbS14: 1-746
Step 2 Gen uniGram, sorted		25 min. Documents: 22,361,238 Sentences: 126,681,740 Tokens:2,611,996,832 (100%) n-grams (unique tokens): 21,710,680
Step 3 Norm (n-1)-gram			(n-1)=1, 4 min.	(n-1)=2, 55 min.	(n-1)=3, 3.5 hr.	(n-1)=4, 3 hr.
Step 4 Gen (n-1)-gram for threshold on NWC			(n-1)=1 th=50, 1 min. Norm n-grams: 353,831 n-grams: 5,030,318	(n-1)=2 th=7500, 4 min. Norm n-grams: 32,400 n-grams: 748,220	(n-1)=3 th=1000, 5 min. Norm n-grams: 139,662 n-grams: 1,175,570	(n-1)=4 th=100, 5 min. Norm n-grams: 778,320 n-grams: 2,929,856
Step 5 Gen n-Gram		th=0 (no threshold used), 25 min. Documents:22,361,238 Sentences: 126,681,740 Tokens:2,611,996,832 Filtered unigrams: 2,611,996,832 (100%) Unique unigrams: 21,710,680	th=50, 5 hr. Documents:22,361,238 Sentences: 126,681,740 Tokens:2,485,317,128 Filtered bigrams: 2,482,775,482 (99.89%) Unique bigrams: 204,020,203	th=7,500, 22 hr. Documents:22,361,238 Sentences: 126,681,740 Tokens: 2,359,078,026 Filtered trigrams: 1,542,069,455 (65.37%) Unique trigrams: 225,891,864	th=1,000, 9 hr. Documents:22,361,238 Sentences: 126,681,740 Tokens: 2,233,430,789 Filtered fourgrams: 811,458,826 (36.33%) Unique fourgrams: 204,623,839	th=100, 6 hr. Documents:22,361,238 Sentences: 126,681,740 Tokens: 2,108,489,103 Filtered fivegrams: 464,595,426 (22.03%) Unique fivegrams: 186,219,982
Step 6 Sort n-Gram			33 min.	45 min.	40 min.	33 min.

The SPECIALIST Lexicon