Dictionary Functions - Check Proper Noun
I. Introduction
Proper nouns should be checked separately for spelling errors to increase the performance. Proper nouns could include mixed cases as shown in the table below.
| Capitalized | Aachen, Beyer, Colgate
|
|---|
| Mixed Cases | zur Hausen, ABC Medical Center, al-Tawil
|
|---|
| lower case | amicon, coll, dang
|
|---|
| upper case | BCDE, BSMMU, CINAHL
|
|---|
II. Approaches
Three approaches are compared as follows:
- By Algorithm:
- As implemented in baseline, proper nouns are detected by algorithm:
- By Data - case sensitive:
- Use proper nouns from Lexicon
- Use case sensitive dictionary
- By Data - case insensitive:
- Use proper nouns from Lexicon
- Use non-case sensitive dictionary
III. Results
Test result with Single-Word, English-Word as dictionary:
| Approach | TP|Ret|Rel | Precision | Recall | F1
|
|---|
| Algorithm | 521|710|814 | 0.7338 | 0.6400 | 0.6837
|
|
|
| Data-Case | 537|755|814 | 0.7113 | 0.6579 | 0.6845
|
| Data-No Case | 537|751|814 | 0.7150 | 0.6579 | 0.6863
|
- With data approach, F1 and recall are increased, precision is decreased.
- The [TP] is the same between two data approaches, the difference in retrieval includes 4 [FP]:
- 14276 prego preg => Prego, no case is not right
- 16167 thier ther => Thier, no case is not right
- 17055 veracruz vera cruz => Veracruz, no case is good
- 17991 gujarat gujar at => Gujarat, no case is good
=> It is about 50% correct for using case-sensitive approach, and result in worse precision and F1 compared to case-non-sensitive approach (because F1 and precision are all above 70%). Thus, the data non-sensitive approach is implemented. One of the main reason for using case insensitive is that users (consumers) might put lowercase/upper case/mixed case for proper nouns. So the chance is 50/50.
- Use data - case sensitive could increase the recall (by finding more spelling errors), but, it will rely on the ranking algorithm to find the correct word for improving precision.