Best Practices
Below is a collection of useful tips and tricks that can help improve results.
How to begin
Use cases for chatbots, virtual assistants, QnA, and FAQ search
In most cases, Lucidworks recommends you begin with coldstart models.
-
The coldstart model is only suitable for English. And the large coldstart model increases query time. So if a moderate query time is acceptable, use a coldstart model.
-
For multilingual support or fast query time, use a multilingual model.
Coldstart models
-
Robust out-of-the-box models.
-
Easily configured and do not require training data.
-
Used to establish a solid baseline solution to compare when running supervised training.
Use cases for never null and eCommerce
For these use cases, the preferred method is to build a training set and use a supervised job. It is usually possible to set up a training dataset. However, if that is not possible, set up an initial coldstart pre-trained model.
The quality of model and results may vary because models are designed to use natural language questions and not short, less defined queries. To enhance the results:
-
Collect signal data (if it does not exist).
-
Build a training set.
Obtain training data
There are multiple ways to obtain training data. The supervised training job uses pairs of questions and answers (or queries and product names) to train deep learning encoders.
Use cases for chatbots, virtual assistants, QnA, and FAQ search
For these use cases:
-
Base training data typically consists of existing FAQ pair information.
-
After the coldstart solution is deployed, training data can also be derived from sources such as search logs and customer support correspondence. The data can be constructed and expanded manually.
-
Expanded training data can be obtained using the augmentation techniques in the Data Augmentation job. For natural language scenarios, backtranslation is effective to provide paraphrased versions of the original questions and answers that improve vocabulary and semantic coverage.
Use cases for never null and eCommerce
Training data for these use cases is primarily collected signals data.
There are different ways to construct training data, which determines how the resulting trained model behaves based on the:
-
Available types of signals.
-
Quantity of signals.
-
Needs of your business.
Examples include:
-
Query and product pair click signals:
-
Help increase the click-through rate (CTR).
-
Are usually high volume, but low quality.
-
-
Add-to-cart and purchase signals:
-
Help increase sales.
-
Have a very strong relationship between query and products.
-
Are high quality, but low volume.
-
Produce a resulting trained model that ranks products most likely to be purchased higher than it does other products (in the results generated by the query).
-
-
Null-search signals result from products selected after a zero-result search or an abandoned search.
-
Different training datasets can be combined to train one model to rule them all or results from the different models can be combined at query time.
How to evaluate results
The supervised training job incorporates the evaluation mechanism. A hold out validation set:
-
Is constructed automatically from unique questions and queries.
-
Can be sized in the job configuration.
-
Provides the ability to:
-
Evaluate how well models generalize to new, unseen queries.
-
Prevent the occurrence of overfitting.
-
You can also Evaluate a Smart Answers Query Pipeline to:
-
Ensure configuration is set correctly.
-
Compare different models or pipelines.
-
Perform an end-to-end evaluation of the entire query pipeline.
How to choose a training model
Select the model that best fits your training needs.
RNN models
RNN models support most use cases, including digital commerce and heavy query traffic domains. The models also provide:
-
Easily-understood training.
-
Extremely fast query time.
While several model cases are available, Lucidworks recommends that in most cases, begin with word_en_300d_2M
. If the case requires other language support, use bpe_multilingual
or one of the specific bpe
models for CJK languages.
BERT models
BERT models:
-
Are only recommended if you have GPU access. Transformer-based models can only use a single GPU. Even if multiple GPUs are available, this type of model does not scale to multiple GPUs.
-
Are suitable for natural language queries, chatbots, virtual assistants, and low query traffic domains.
-
Provide better semantic quality results, especially if only a small amount of training data is available.
-
Are more expensive to use for training.
-
Run more slowly at query time than RNN models.
Model vector size
Pre-trained models produce vectors with a dimension of 512.
Vector dimensions for supervised job trained models are highlighted at the end of the training logs, or can be derived from the model and parameters:
-
BERT-based model vectors have a dimension of 768.
-
RNN-based model vectors have a dimension of two times the size of the last specified layer.
Hyperparameters to tune the training job
Most of the parameters in the supervised training job contain effective default values. Some of the parameters are automatically determined based on the dataset statistics.
Specific parameters to review are listed below, but see Advanced Model Training Configuration for Smart Answers for detailed information about tuning the training job.
General encoder parameters
These parameters are common to BERT and RNN models. The most significant parameter in this section is Fine-tune Token Embeddings. The parameter is disabled by default to prevent overfitting. However, enable the parameter to improve queries with multiple variations or jargon. For example, queries in the eCommerce domain.
RNN encoder parameters
- Dropout Ratio
-
Tune in the [0.1, 0.5] range to improve regularization and prevent overfitting. The customary value is 0.15 or 0.3.
- RNN Function (Units) List
-
The parameters determine the number of layers to use and the output vector size.
One layer is typically sufficient, and Lucidworks does not recommend using over two layers. Adding a smaller second layer produces a smaller vector size that reduces the vector index size and might reduce query time.
For example, the current configuration is a one-layer network with 128 units (256 vector size) that provides good results. To improve QPS and index a very large collection without losing quality, add a small layer with 64 units (128 vector size). Instead of adding one small layer, you can add two small layers. The two additional small layers slightly increases encoding time, but the vector size is two times smaller in memory, which may improve search time.
Training and evaluation parameters
To reduce training time, especially when training on a CPU, lower the value in following parameters:
-
Number of Epochs
-
Monitor Patience
To increase training time, increase the value in the parameters. |
The Training Batch Size parameter impacts the timeframe of an epoch and how often validation is performed. Lucidworks recommends setting the following:
-
At least 100 batches per training epoch
-
Use bigger batches with larger training datasets
-
Values of:
-
[32, 64] for small training sets
-
[128, 256] for medium training sets
-
[512, 1024, 2048] for large training sets
-
Due to the size of BERT-based models, the training batch size values recommended are [8, 16, 32]. |
How to choose a Milvus index
Milvus supports multiple indexes. See the following Milvus documentation:
-
Vector index for detailed descriptions of the indexes.
-
Performance FAQ for performance information.
Default index
The default (FLAT) index provides:
-
The best possible quality (typically 100% of the model quality), but is the least effective in terms of query time.
-
A reasonable query time for low traffic use cases under 250 QPS.
HNSW index
The HNSW index:
-
Supports higher QPS.
-
Supports collections over a few million documents.
-
Needs to be configured with the parameter values detailed in Create Indexes in Milvus Jobs.