Core-term
I. Introduction
Lots of nGrams have punctuation at the begining or/and at the end. Such as:
| Input Term | CoreTerm |
|---|---|
| - in details, | in details |
| - in details | |
| in details, | |
| in (5) details, | in (5) details |
| (in (5) details, | |
| (in (5) details), |
All above n-grmas are normalized to "in details" and "in (5) details" by stripping the leading or/and ending punctuation. The normalized term is called core-term, which is the core of the term. This process is called core-term normalization.
A core term might remain internal punctuation, such as "in (5) details". Also, leading or/and ending puncutation might remian in core-term, such as "clean room(s)".
II. Algorithm
Recursively repeat the following process until term does not change or legnth = 0:
| ASCII | -)}]_!@#%&*\\:;\"',.?/~+=|>$`^ |
| Unicode | ¦§»‐‑‒–—―’”•․‥…⁈ |
| ASCII | -({[_!@#%&*\\:;\"',.?/~+=|>$`^ |
| Unicode | ¦§«‐‑‒–—―‘“•․‥…⁈ |
| ASCII | (), [], {}, <> |
| Unicode | «»‘’“” |
* net bracket no = total left bracket no - total right bracket no
For example,
| Term | Net Bracket No |
|---|---|
| (in details:) | 0 |
| (in (5) details:) | 0 |
| (in (5) details | 1 |
| in (5) details) | -1 |
III. Examples
| Input nGram | Core-term |
|---|---|
| Strip punctuation | |
| -in details | in details |
| In details: | In details |
| #$%IN DETAILS:%^( | IN DETAILS |
| ( | |
| () | |
| Strip brackets | |
| {in (5) details} | in (5) details |
| {{in (5) details} | in (5) details |
| {in (5) details}} | in (5) details |
| {in (5)} details}} | {in (5)} details |
| Strip brackets and punctuation | |
| (in details:) | in details |
| (in details:)) | in details |
| (-(in details)%^) | in details |
| {in (5) days}, | in (5) days |
| in (5 days), | in (5 days) |
| in ((5) days), | in ((5) days) |
| ((clean room(s))) | clean room(s) |
| ((inch(es))) | inch(es) |