Ensemble Algorithm
The high level algorithm of ensemble method for spelling correction are described as follows.
I. Source code:
LinearWeightedEnsembleSpellCorrection.java
II. Algorithm
text: read in text of the whole question
List<Span> processSpans: remove header, such as SUBJECT:, EMAIL:, etc.
fixed: preProcessed text to handle contractions, informational expression, puntuaction, split digits, etc.
List<CoreMap> sentences: use CoreNLP for annotation, treat the whole text as 1 sentence
List<CoreLabel> tokenAnns: Token separated by space and punctuation (NLPCore)
ProcessTokens to get:
List<String> origTokens: Separated by space and period (end of sentences) only.
List<String> modTokens: Tag [MUM] and others
List&Integer> begins: the beginning position of modToken in the origTokens list
List&Integer> positions: the index of modToken in the origTokens list
List&Integer> origPositions: the beginning position of origToken in the origTokens list
correct to get corrected text:
LinkedHashSet<String> suggestions: single word suggestions
Map<String,String> mergeSuggestions: merge suggestions, key: merge suggestion, value: before merge tokens
Where:
| Score | Source Code | Notes |
|---|---|---|
| edScore | DictionaryBasedSpellChecker.getEditSimScore( ) |
|
| phoneticScore | DictionaryBasedSpellChecker.getPhoneticSimScore( ) |
|
| overlapScore | OverLapUtil.leadTrailOverlap( ) | |
| corpusScore | CorpusFrequencyCounts.getUnigramScore( ) | |
| w2vScore | Word2Vector.getSimScore( ) |