SPECIALIST Lexicon

Multiword Candidate Generation Processes:
UMLS_CUI with top endWords in the Distilled Medline N-gram Set

N-grams from the latest Distilled Medline N-gram Set that are:

core-term, lowercase
Filter: exclude terms from Lexicon
Matcher: pass terms that are direct match in UMLS-Str (field 15) because such string have CUI
Filter: exclude single Words
Matchers: pass terms that match the top (33+) endWords from Lexicon
- top endWords are: syndrome, protein, disease, proteins, cell, etc..
Pre-Preocess: 06.NGramUtil: Steps 20-21 (core.lc)
Pre-Preocess: Matcher EndWord: Steps 1 (10.MatcherEndWord)
- flds 1 EndWord.1.analysis.stats > EndWord.1.analysis.stats.1
- Get the top N end words (endWords.top.data.${YEAR})
Process: Matcher CUI: Steps 30-33 (09.MatcherCui)
Proocess: Matcher CUI: Steps 34 (09.MatcherCui)
Use Step 35/36 to rearrange the order in candidate list by grouping singulars and plurals together

Generated files:
Dir: ${MULTIWORDS}/data/${YEAR}/outData/09.MatcherCui/Meta/Cand_List/36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}.top{N}

Distilled MEDLINE nGram Set	Candidate Files	Status	Notes
2016	35.disNGram.Core.endword.out.gsp.2016	Done	Top 33 endWords
2017	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2017	Done	Top 43 endWords
2018	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2018	Done	Top 51 endWords
2019	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2019	Done	Top 57 endWords
2020	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2020	Done	Top 80 endWords
2021	36.disNGram.Core.endword.out.rmYesTagNo.gsp.2021	Processing Starting on 04.25	Top 85 endWords

TBD: In the future, use all high frequency n-gram without endWord Matcher.
=> That is to use 33.disNGram.Core.multiword.out. However, this seems does not have high precision. Maybe use it with SpVars model or Deep learning model.