Data Science Toolkit Integration

Table of Contents

DSTI components
DSTI Deprecations

In Fusion 5.1, data scientists and machine learning engineers can deploy end-user-trained Python machine learning models to Fusion using the Data Science Toolkit Integration (DSTI). This offers real-time prediction and seamless integration with query and index pipelines.

Benefits:

Extension points for data scientists to plug in customized Python modeling code
Client libraries to ease the development and testing of Python plugins
API-driven and dynamic, runtime loading and updating of plugins

Example use cases:

Using SpaCy to extract named entities and indexing results into a Solr collection
Using a Keras model to perform query intent classification at query time
Using pre-trained word embeddings to generate synonyms for a query

DSTI components

Jupyter Notebook service. A fully-integrated Jupyter notebook in Fusion that allows for data scientists to explore data, Test SQL aggregations, and Run Fusion SQL statements. And import / export data to/from other storage mechanisms using Spark and choice of their language: Scala or Python. DSTI Component: Jupyter notebooks are still supported in Fusion 5.0.x.
Machine Learning service. Support model-serving in index pipelines.

DSTI Deprecations

In Fusion 5.3 and later, the ability to deploy Python-based models using the DSTI ml-python image has been removed. All Python-based models should be migrated to Seldon Core (see here for a tutorial on wrapping a Python-based model to work with Seldon Core).

SparkML and MLeap models are still supported via the DSTI, but the integration is deprecated as of Fusion 5.1 and these models should be migrated to Seldon Core as well. The DSTI support for these models will be removed in an upcoming version of Fusion.

Users that were taking advantage of the (spaCy model supplied with Fusion 5.0-5.2 will instead need to use the Create Seldon Core Model Deployment job within Fusion to deploy a Seldon Core-enabled version which can be used as a drop-in replacement for the old model in the NLP Annotator stage. A sample configuration for deploying the replacement model will look like this:

{
   "type":"argo-deploy-model",
   "id":"seldonspacy",
   "deployModelName":"spacy-seldon",
   "modelReplicas":1,
   "modelDockerImage":"spacy-seldon:v1.0",
   "modelDockerRepo":"lucidworks",
   "columnNames":"[token_offsets, pos_labels, lemma_labels, ner_offsets, ner_labels, sentence_offsets]",
   "workflowName":"argo-deploy-model-workflow",
}