Sub-Term Mapping Tools

Sub-Term APIs: Find Sub-Terms Design

I. Introduction
This section describes the sub-term related methods used to find:

  • All sub-terms of a term in a corpus
  • The longest prefix sub-term of a term in a corpus

II. Algorithm

  • Init Vector<String> matchTerms
  • Get inWords by tokenizing newInTerm
  • Go through terms from the inWords
    • Get curTerm from startIndex of inWords
    • Find branchMatches
      • Add " $_END" (the END node)
      • Tokenize normalized term into inWords as a Vector<String>
      • Set the curNode to ROOT node
      • Init Vector<String> branchMatches
      • Go through the inWords
        • Initiate curWordNode by the curWord
        • get curChilds from curNode
        • Check if curChilds has END node
          • Yes => add the branch term to branchMatches
        • Check if curChilds contains curWordNode
          • yes => update curNode
          • no => not match (false), break
    • Add branchMatches to matchTerms

III. Java Classes & Method

  • SubtermApi.java: a Java class for sub-term methods
    • public static Vector FindSubtermStrs(String inTerm, Corpus corpus)
    • public static Vector FindSubterms(String inTerm, Corpus corpus)
    • public static Vector FindAllPrefixSubterms(String inTerm, Corpus corpus)
    • public static String FindLongestPrefixSubterm(String inTerm, Corpus corpus)

IV. Examples

  • Synonym Rules:

    wordsynonym
    dogcanine
    catfeline
    canineK9
    K9bull dog
    Dog and catpets
    puppy and kittypets

  • Synonym Terms:

    Terms
    dog
    canine
    cat
    feline
    k9
    bull dog
    dog and cat
    pets
    puppy and kitty

  • Input Term:
    Who let dog and cat out
    • Normalize: who let dog and cat out
    • Go through terms from "who let dog and cat out"
      icurTermbranchMatchesmatchTerms
      0who let dog and cat out  
      1let dog and cat out  
      2dog and cat out
      • dog
      • dog and cat
      • dog
      • dog and cat
      3and cat out 
      • dog
      • dog and cat
      4cat out
      • cat
      • dog
      • dog and cat
      • cat
      5out 
      • dog
      • dog and cat
      • cat

  • Output:

    return matched terms | start index | end indexes:

    • dog|2|3
    • dog and cat|2|5
    • cat|4|5

  • Trie Tree