Natural Language Processing
This topic describes Fusion’s Natural Language Processing (NLP) features, available in the legacy OpenNLP NER Extraction index pipeline stage and NLP Annotator index and query pipeline stages.
OpenNLP NER Extraction pipeline stage
The OpenNLP NER Extraction index pipeline stage performs only Named Entity Recognition (NER). This stage is available in all versions of Fusion.
For additional NLP functionality, use the NLP Annotator pipeline stages, available in Fusion versions 4.2 and later. See below for details.
NLP Annotator pipeline stages
The NLP Annotator is both an index pipeline stage and a query pipeline stage. The NLP Annotator performs a variety of fundamental NLP tasks:
If configured in an index pipeline, the NLP annotator performs selected NLP tasks on raw document content during the indexing process (see more details here). If configured in a query pipeline, the NLP annotator performs selected NLP tasks on the query text content (see more details here).
NLP features
Fusion’s NLP Annotator pipeline stages include the NLP features described below.
Sentence detection
Sentence detection is the process of analyzing text to determine sentence boundaries. It is typically the first step taken when performing any kind of natural language processing on a document. Commonly, a sentence is indexed as a multi-value field that can be used for various purposes, as in these examples:
-
Relevancy: Boost documents whose first sentence matches the query terms.
-
Snippets: When presenting the search results, display the first few sentences of each document.
Named Entity Recognition (NER)
Named Entity Recognition is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under these predefined classes:
-
person
-
organization
-
location
For example:
Person | Organization | Location | ||
---|---|---|---|---|
Jane |
is the CEO of |
Example Company |
, based in |
San Francisco. |
Name entity recognition is widely leveraged by today’s text mining projects. When organizations store large volumes of business documents in Fusion, the natural next step is to turn the large volume of text-centric data into some kind of knowledge base.
Take entity linking projects, for example: The client may want to link all relevant documents with an existing list of entities of interest. One way of doing this is to extract entities from all raw text documents, then perform fuzzy matching or another kind of text pattern matching to link relevant documents with a specific entity from the given list. This is more efficient than scanning the whole document and trying to search for the entity name. In this scenario, NER extraction is an ideal tool.
Fusion has integrated NER capability into its indexing and query pipelines to enable customers to perform knowledge discovery easily.
Part-of-Speech (POS) tagging
One of the most important roles of POS tagging is "word sense disambiguation". For instance, when searching for the word "present", if the intent is to look for the concept of gift, then having the word "present" tagged as a "noun" will help filter out content with "present" as a verb, representing an action of bringing before the public.