Multiword Candidate Generation Processes:
UMLS_CUI with top endWords in the Distilled Medline N-gram Set
- N-grams from the latest Distilled Medline N-gram Set that are:
- core-term, lowercase
- Filter: exclude terms from Lexicon
- Matcher: pass terms that are direct match in UMLS-Str (field 15) because such string have CUI
- Filter: exclude single Words
- Matchers: pass terms that match the top (33+) endWords from Lexicon
- top endWords are: syndrome, protein, disease, proteins, cell, etc..
- Pre-Preocess: 06.NGramUtil: Steps 20-21 (core.lc)
- Pre-Preocess: Matcher EndWord: Steps 1 (10.MatcherEndWord)
flds 1 EndWord.1.analysis.stats > EndWord.1.analysis.stats.1
- Get the top N end words (endWords.top.data.${YEAR})
- Process: Matcher CUI: Steps 30-33 (09.MatcherCui)
- Proocess: Matcher CUI: Steps 34 (09.MatcherCui)
- Use Step 35/36 to rearrange the order in candidate list by grouping singulars and plurals together
- Generated files:
Dir: ${MULTIWORDS}/data/${YEAR}/outData/09.MatcherCui/Meta/Cand_List/36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}.top{N}
- TBD: In the future, use all high frequency n-gram without endWord Matcher.
=> That is to use 33.disNGram.Core.multiword.out. However, this seems does not have high precision. Maybe use it with SpVars model or Deep learning model.