Legacy Product

Fusion 5.10
    Fusion 5.10

    Trains Smart Answers model on a supervised basis with pre-trained or trained embeddings and deploys the trained model to the ML Model Service

    id - stringrequired

    The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    useAutoML - boolean

    Automatically tune hyperparameters (will take longer to train). Transformer models aren't used in this regime

    Default: false

    trainingCollection - stringrequired

    Solr collection or cloud storage path where training data is present.

    >= 1 characters

    trainingFormat - stringrequired

    The format of the training data - solr, parquet etc.

    >= 1 characters

    Default: solr

    trainingDataFilterQuery - string

    Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`

    secretName - string

    Name of the secret used to access cloud storage as defined in the K8s namespace

    >= 1 characters

    questionColName - stringrequired

    Name of the field containing questions

    >= 1 characters

    answerColName - stringrequired

    Name of the field containing answers

    >= 1 characters

    weightColName - string

    Name of the field to be used for weights

    >= 1 characters

    deployModelName - stringrequired

    Name of the model to be used for deployment (must be a valid lowercased DNS subdomain with no underscores)

    <= 30 characters

    Match pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$

    modelReplicas - integerrequired

    How many replicas of the model should be deployed by Seldon Core

    Default: 1

    modelBase - stringrequired

    Specify one of these custom embeddings: ['word_custom', 'bpe_custom'] or choose one of the included pre-trained embeddings / models.

    Default: word_en_300d_2M

    Allowed values: word_custombpe_customword_en_300d_2Mbpe_en_300d_10Kbpe_en_300d_200Kbpe_ja_300d_100Kbpe_ko_300d_100Kbpe_zh_300d_50Kbpe_multi_300d_320Kdistilbert_endistilbert_multibiobert_v1.1

    trainingSampleFraction - number

    The proportion of data to be sampled from the full dataset. Use a value between 0 and 1 for a proportion (e.g. 0.5 for 50%), or for a specific number of examples, use an integer larger than 1. Leave blank for no sampling

    minTokensNum - integer

    Drop document if the total words is lower than this value

    >= 1

    exclusiveMinimum: false

    Default: 1

    maxTokensNum - integer

    Drop document if the total words is greater than this value

    >= 1

    exclusiveMinimum: false

    Default: 5000

    lowerCases - boolean

    Whether to lower case all words in training, i.e. whether to treat upper case and lower case words equally. Only utilized for custom embeddings or for the default model base: word_en_300d_2M.

    Default: false

    maxVocabSize - integer

    Maximum number of words in vocabulary, words will be trimmed if frequency is too low. Only utilized for custom embeddings or for the default model base: word_en_300d_2M.

    >= 1

    exclusiveMinimum: false

    w2vEpochs - integer

    Number of epochs to train custom word2vec embeddings

    Default: 15

    w2vTextsCollection - string

    Solr collection or cloud storage path which contains extra documents that will be used to get better vocabulary coverage as well as to train custom word embeddings if custom Model Base is specified.

    w2vTextColumns - string

    Which fields in the text collection to use. If multiple fields, please separate them by comma, e.g. description_t,title_t.

    textsFormat - string

    The format of the texts training data - solr, parquet etc.

    w2vVectorSize - integer

    Word-vector dimensionality to represent text (suggested dimension ranges: 100~300)

    Default: 150

    w2vWindowSize - integer

    The window size (context words from [-window, window]) for Word2Vec

    Default: 8

    valSize - number

    Proportion of the unique questions that should be used as validation samples. When val_size > 1, then that specific number of unique questions will be sampled rather than a proportion.

    >= 0.001

    exclusiveMinimum: false

    Default: 0.1

    maxLen - integer

    Maximum length of text processed by the model. Texts longer than this value will be trimmed. This parameter is especially important for Transformer-based models as it affects training and inference time. Note that the maximum supported length for Transformer models is 512, so you can specify any value up to that. The default value is the max value between three times the STD of question lengths and two times the STD of answer lengths.

    embSPDP - number

    Fraction of input to drop with Dropout layer (from 0-1)

    Default: 0.3

    trainBatch - integer

    Batch size during training. If left blank, this will be set automatically based on the input data

    infBatch - integer

    Batch size during validation. If left blank, this will be set automatically based on the input data

    rnnNamesList - string

    List of layers of RNNs to be used, with possible values of lstm, gru. E.g. ["lstm", "lstm"]. This value will be automatically decided based on data if left blank

    rnnUnitsList - string

    List of RNN layer units numbers, corresponding to RNN function list. E.g. 150, 150. This value will be automatically decided based on data if left blank

    epochs - integer

    weightDecay - number

    L2 penalty used in AdamW optimizer. Bigger values will provide stronger regularization. Default values are 0.0003 for RNN models and 0.01 for Transformer models.

    monitorPatience - integer

    Stop training if no improvement in metrics by this number of epochs

    baseLR - number

    Base learning rate that should be used during training. Reasonable values are from 0.0001 to 0.003 depending on model base. It's better to use lower LR with Transformer models.

    minLR - number

    Minimum learning rate used during training. Reasonable values are from 0.00001 to 0.00003.

    numWarmUpEpochs - integer

    Number of epochs used for the warm-up stage for learning rates. Reasonable values are from 0-4 epochs, usually 1-2 are used.

    numFlatEpochs - integer

    Number of epochs used in flat stage for learning rates. Reasonable value would be one-half of the epochs, so the other half will be with Cosine Annealing learning rate.

    monitorMetric - string

    The main metric at k that should be monitored to decide when to stop training. Possible metrics are: ["map", "mrr", "recall", "precision"]

    Default: mrr@3

    monitorMetricsList - string

    List of evaluation metrics on validation data that will be printed in the log at the end of each epoch. Possible metrics are: ["map", "mrr", "recall", "precision"]

    Default: ["map", "mrr", "recall"]

    kList - string

    The k retrieval position that will be used to compute for each metric

    Default: [1,3,5]

    numClusters - integer

    DEPRECATED: please, consider using Milvus for fast dense vector similarity search. Number of clusters to be used for fast dense vector retrieval. Note no clustering will be applied if this is set to 0. If left blank, cluster count will be inferred by the job depending on the data

    Default: 0

    topKClusters - integer

    How many closest clusters the model can find for each query. At retrieval time, all answers in top k nearest clusters will be returned and reranked

    Default: 10

    unidecode - boolean

    Use Unidecode library to transform Unicode input into ASCII transliterations. Only utilized for custom embeddings or for the default model base: word_en_300d_2M

    Default: true

    useMixedPrecision - string

    Check this option to train a model with mixed precision support.This will only work if the node has a GPU. You'll only see a speed up on newer NVidia GPUs (Turing and later) with Transformer models.

    Default: auto

    Allowed values: autotruefalse

    useLabelingResolution - boolean

    Check this to determine similar questions and similar answers via labeling resolution and graph connected components. Does not work well with noisy data like eCommerce queries. But helps with FAQ / QnA data.

    Default: false

    useLayerNorm - boolean

    Check this to use layer norm for pooling.

    Default: false

    globalPoolType - string

    Determines how token vectors should be aggregated to obtain final content vector. Must be one of: [avg, max, self_attention].

    Default: self_attention

    Allowed values: avgmaxself_attention

    embTrainable - boolean

    Choose this to fine-tune token embeddings during model training. Tends to work well with eCommerce data.

    Default: false

    eps - number

    Epsilon is the AdamW optimizer. By default 1e-8 is used for RNN models and 1e-6 is used for Transformer models.

    maxGradNorm - number

    Max norm used for gradients clipping. By default it’s not used for RNN models but 1.0 value is used for Transformer models.

    useXbm - string

    Stores encoded representations of previous batches in memory for better negative examples sampling. Works well for Transformer models. Leave this at 'auto' to let the training module determine this.

    Default: auto

    Allowed values: autotruefalse

    xbmMemorySize - integer

    Number of examples from the previous batches that are stored in memory. The default size for Transformer models is 256.

    xbmEpochActivation - integer

    After which epoch cross-batch memory should be activated. By default it’s activated after the first epoch for Transformer models.

    evalAnnIndex - string

    Choose this to use Approximate Nearest Neighbor search during evaluation. For big datasets it can speed up the evaluation time with minimum loss in accuracy, for small datasets it will most likely make it slower.

    Default: auto

    Allowed values: autotruefalse

    distance - string

    Vectors distance/similarity that should be used during training and in the pipelines. Choose one of: ['cosine_similarity', 'dot_product_similarity', 'euclidean_distance'].

    Default: cosine_similarity

    Allowed values: cosine_similaritydot_product_similarityeuclidean_distance

    type - stringrequired

    Default: argo-qna-supervised

    Allowed values: argo-qna-supervised