Ending Punctuation Splitter
This splitter is used to process a split by adding a space after ending punctuation if a token contains ending punctuation. Ending punctuation includes: .?!,:;&)]}
Split a token in front of ending punctuation.
| File Name | Input | Output |
|---|---|---|
| 10023.txt | down.please | down. please |
| 10286.txt | ...my | ... my |
| 10004.txt | cancer?if | cancer? if |
| 11186.txt | ?pls | ? please |
| 97.txt | suggestions?thanks | suggestions? thanks |
| 53.txt | hello!can | hello! can |
| 11186.txt | ,she | , she |
| 16823.txt | :by | : by |
| 22.txt | ;syrinx | ; syrinx |
| 2.txt | )why | ) why |
| Broader Generic Matchers | ||
|---|---|---|
| Matcher | Regular Expression | Examples |
| Contains Ending Punctuation | ^.*[\\.\\?!,;:&\\)\\]\\}].*$ | |
| Email (false) | ^[\\w!#$%&'*+-/=?^_`{|}~]+@(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net)))$ |
|
| Url (false) | ^((ftp|http|https|file)://)?(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net|uk)).*)$ |
|
| Pure digit or punctuation (false) | ^([\\W_\\d&&\\S]+)$ |
|
| Filters (Specific Exceptions for Each Ending Punctuation) | |||
|---|---|---|---|
| Ending Punctuation | Filter (Exception) | Regular Expression | Examples |
| Period [.] | 1. Plural form | (.*\\.s) |
|
| 2. surrounded by digit [char]*[digit].[digit][char]* | ((\\w*\\d\\.\\d\\w*)+) |
| |
| 3. surrounded by single characters [single non-digit].[single non-digit]? | ((\\D\\.)+\\D?) |
| |
| 4. followed by a hyphen [word]*.-[word]* | (\\w*\\.-\\w*) |
| |
| 5. followed by a quote [char]*.['"] | (.*\\.['\"]) |
| |
| Question Mark [?] | 1. followed by a quote [char]*?['"] | (.*\\?['\"]) |
|
| Exclamation Mark [!] | 1. followed by a quote [char]*!['"] | (.*!['\"]) |
|
| Comma [,] | 1. digit group separator [digit]+,[digit]{3} | (\\d+(,[\\d]{3})+) |
|
| Colon [:] | 1. ratio [digit]+:[digit]+ | (\\d+:\\d+) |
|
| Semicolon [;] | 1. No exceptions found | $^ | None |
| Ampersand [&] | 1. Abbreviations [A-Z]+&[A-Z]+ | [A-Z]+&[A-Z]+ |
|
| Right Parenthesis [)] | 1. single char surrounded by parenthesis [non-space]*([+char])[non-space]* | ((\\S)*\\([+\\w]\\)(\\S)*) |
|
| 2. chars surrounded by parenthesis and followed by a hyphen [non-space]*(char+)-[non-space]* | ((\\S)*\\([+\\w]+\\)-(\\S)*) |
| |
| 3. digit surrounded by parenthesis [non-space]*(digit+)[non-space]* | ((\\S)*\\(\\d+\\)(\\S)*) |
| |
| Right Square Bracket []] | 1. [digit]+[Upper] surrounded by [] [non-space]*[[digit]+[Upper]][non-space]* | (\\S*\\[\\d+[A-Z]\\]\\S*) |
|
| 2. [lower] surrounded by [] [Upper]+ | (\\S*\\[[a-z]\\]\\S*) |
| |
| Right Curly Brace [}] | 1. No exceptions found | $^ | None |