Consumer Health Corpus - The Raw N-gram Set
I. Introduction
The Consumer Health Corpus (used in the CSpell) is used to retrieve the n-gram set for LMW candidate generation.
II. N-gram set Specifications
| Document count | Word Count | N-gram |
III. Process
${LMW_DIR}/bin/21.CSpellHealthCorpus
2017
| Option | Description | Inputs - ${IN_DIR}/${OUT_DIR} | Outputs - ${OUT_DIR} | Notes |
|---|---|---|---|---|
| Generate the raw n-gram set | ||||
| 2 | Convert Xml files to Raw Corpus Text files | CSpellHealthCorpus/Crawl/*/*.html | 21.CSpellHealthCorpus/RawCorpus/*.data
|
|
| 4 | Generate all raw n-gram files (N = 1-5) | 21.CSpellHealthCorpus/RawCorpus/*.data | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data |
|
| 6 | Sort all raw n-gram files | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data |
| |
| 7 | Generate the raw n-gram set | 21.CSpellHealthCorpus/nGrams/nGram.${N}.data.dwt | 21.CSpellHealthCorpus/nGrams/nGramSet.${YEAR}.1 | min wC = 1 |
| 8 | Zip n-gram set |
|
| |
IV. Results
| N-grams | File | Zip Size | Actual Size | No. of n-grams |
|---|---|---|---|---|
| Unigrams | nGram.1.2017.tgz | 0.985 Mb | 2.8 Mb | 194,407 |
| bigrams | nGram.2.2017.tgz | 6.6 Mb | 23 Mb | 1,233,365 |
| Trigrams | nGram.3.2017.tgz | 18 Mb | 65 Mb | 2,806,783 |
| Four-grams | nGram.4.2017.tgz | 29 Mb | 111 Mb | 3,906,380 |
| Five-grams | nGram.5.2017.tgz | 39 Mb | 149 Mb | 4,396,030 |
| N-gram Set | nGramSet.2017.1.tgz | 92 Mb | 350 Mb | 12,536,965 |