To arrange a SolrTextTagger index, or the Aho-Corasick string matching algorithm basically, the essential concept is to populate the FSM with the dictionary entries. Internally, it will create a Trie construction connecting the phrases within the dictionary entries to one another. At runtime, you’ll stream textual content towards this construction, and it’ll match phrases within the textual content that match the dictionary entries. The one main drawback of this strategy is that it’ll solely match phrases within the textual content which might be in the identical order because the equal phrase within the dictionary. For instance, if my dictionary entry is « tuberculosis pulmonary », it is not going to match « pulmonary tuberculosis » within the textual content. Relying in your software, this will likely or is probably not vital. However I lately found a method to do it, because of the concepts introduced within the paper Efficient Clinical Concept Extraction in Electronic Medical Records (Guo, Kakrania, Baldwin, and Syeda-Mahmood, IBM Analysis Almaden, 2017). Hat tip to my supervisor Ron Daniel for pointing me to the paper.
Particularly, the half that caught my eye was their instance of having the ability to match the string « Chest CT » from the phrase « a subsequent CT scan of the chest », and their description of their strategy within the paper. As you’ll be able to see, there’s not a lot element in there, so I’m not certain if my strategy is an identical to theirs or not, but it surely bought me pondering, ensuing within the algorithm I’ll describe on this publish. Within the spirit of reproducibility, the outline is accompanied by my Jupyter pocket book containing code to build the FSM and query it utilizing a memory-based FSM utilizing the PyAhoCorasick library. The pocket book incorporates a number of examples of utilizing the Aho-Corasick algorithm to match strings towards an in-house medical dictionary within the inexact method described, towards quick queries (together with the instance cited by the paper), in addition to lengthy paragraph texts.
To hurry up the processing, we suggest a novel indexing technique that considerably reduces the search house whereas nonetheless sustaining the requisite flexibility in matching. First, every phrase within the vocabulary is represented by a singular prefix, the shortest sequence of letters that differentiates it from each different time period. Subsequent, an inverted index is created for the mapping from prefixes to report sentences. Ranging from the consultant prefix of every time period within the vocabulary (or a set of prefixes within the case of a multi-word time period), all related sentences might be simply retrieved as potential matches for the time period, and post-filtering by longest widespread phrase sequence matching can be utilized to additional refine the search outcomes.
The large concept right here is as follows. As an alternative of populating the FSM with phrases from the dictionary, we populate it with phrases that the phrases are constructed out of. Once we populated it with phrases, the payload related to it was the idea ID. Once we populate it with phrases, the possibilities of collision is nearly sure. For instance, the phrase « tuberculosis » happens 1,352 instances within the dictionary. So the payload is now a listing of idea ID and synonym ID pairs, very like the inverted index construction utilized in search engines like google and yahoo. The idea ID is self-explanatory, the synonym ID is a operating serial quantity (for instance, synonym ID 0 is the first identify for the idea). We filter out cease phrases and non-alpha phrases in an try to cut back the noise. Matching is case insensitive (each dictionary and textual content are lowercased) except the phrase is an apparent abbreviation and given in all-caps.
Along with the idea synonym ID pair (CSID hereafter), we additionally retailer within the payload a floating level quantity representing the worth of a match. The thought is that the string « tuberculosis » matched towards the idea for « tuberculosis » is extra useful than a match towards the idea for « pulmonary tuberculosis » or « pulmonary tuberculosis signs ». The load for every time period within the synonym is the reciprocal of the variety of phrases (minus stopwords) within the synonym. This will probably be used later to grade the standard of the matches. Thus the payload for every time period is a listing of pairs (CSID, weight) for every incidence of the time period in a synonym.
As a result of the PyAhoCorasick API would not enable us to incrementally add to the payload, we preprocess the TSV file representing the dictionary to create a dictionary of phrases and listing of (CSID, weight) pairs. We then use the dictionary to populate an Aho-Corasick FSM. Internally, the FSM states are simply characters, so when querying this construction you’ll have to filter the output, retaining the matches that correspond to the phrases within the question.
When querying, you must apply the identical set of transformations that you simply did on the dictionary phrases, particularly lowercasing except an apparent abbreviation, eradicating stopwords and non-alpha phrases. As talked about earlier, you must filter the outcomes to solely retain matches to full phrases. The following step is to post-process the outcomes to extract a set of inexact matches.
Step one is populate a weight matrix W utilizing the weights from the payload of the filtered outcomes. The rows of this matrix corresponds to the variety of phrases being matched, and the columns correspond to the variety of CSID entries throughout all of the phrases. As a result of phrases would are inclined to clump round CSIDs, the matrix is predicted to be comparatively sparse. The standard of a match of the complete question string towards a specific CSID is evaluated because the sum of the weights down the corresponding column. You present a confidence threshold which the algorithm will use to retain solely matches whose row sum exceeds it. These are actually moved to the candidate match matrix C. The determine under exhibits the offsets and weights returned for various CSIDs for the question « helicobacter pylori affected person notion ».
Some fascinating issues to notice right here. Discover that each « helicobacter » and « pylori » have their very own matches (8131809:0 and 8121505:4). As well as, each matched CSID 8121505:0. The time period « affected person » matches two CSIDs precisely — 811861:0 and 811861:1. It is because the time period is repeated throughout the synonyms, one with a capital P and one with out. Nonetheless, there could also be instances the place the identical time period might map to a number of totally different idea IDs as a result of they’re genuinely ambiguous.
The following step is to merge offsets. That is the place we take phrases that are consecutive or shut to one another and contemplate them a single span. Closeness is outlined by a person specified span_slop which is ready to five by default. In our instance above, the primary two phrases might be merged for CSID 8121505:0. At this level, we now have moved from single phrases annotated to a number of CSID pairs, to some a number of phrases annotated as effectively.
We now take away the mappings that are wholly contained inside longer mappings. Right here the thought is {that a} string comparable to « helicobacter pylori » is extra particular than both « helicobacter » or « pylori », and the latter mappings needs to be eliminated in favor of the previous. So that is what the algorithm does subsequent. We are actually prepared to drag within the major identify of the matched idea and current the ultimate outcomes, as proven under.
And that is just about it. This method is relevant not solely to quick strings such because the one I used for the instance, but additionally longer our bodies of textual content. In my Notebook with the Code and Examples, I present the identical algorithm getting used to map complete paragraphs, though it might be higher for high quality causes to annotate sentence by sentence.
In some unspecified time in the future, I hope to fold this performance in to SoDA, my open supply wrapper to SolrTextTagger I discussed earlier. However given my present queue, I do not foresee that occuring anytime quickly. If any of you wish to implement this concept in Scala and combine it into SoDA, please ship me a pull request, and I will probably be glad to merge it in and offer you credit score.