Document Clustering Jobs

Table of Contents

Cluster labels and frequent terms
Configuration tips
Configuration properties

The Document Clustering job uses an unsupervised machine learning algorithm to group documents into clusters based on similarities in their content. You can enable more efficient document exploration by using these clusters as facets, high-level summaries or themes, or to recommend other documents from the same cluster. The job can automatically group similar documents in all kinds of content, such as clinical trials, legal documents, book reviews, blogs, scientific papers, and products.

Input

Searchable content (your primary collection)

Output

The job adds the following fields to the content documents:

cluster_id

The IDs associated with each cluster so that we can easily identify the clusters by number. A negative number means the document is an outlier or extremely long.

cluster_label, cluster_label_txt

Unique keywords assigned to each cluster so that there are no overlapping keywords between the clusters.

The cluster label long_doc is applied to very lengthy documents.
The cluster label short_doc is applied to very short documents.
Outliers are grouped with cluster labels like outlier_group0, outlier_group1, and so on.

dist_to_center

The document’s distance from its corresponding cluster center. The shorter the distance, the closer the document is to the center of the cluster.

freq_terms, freq_terms_txt

The most frequent words in the cluster.

clustering_model_id

The ID of the Document Clustering job that attached these fields.

The Document Clustering job is an end-to-end job that includes the following:

Document preprocessing
Separating out extremely lengthy documents and outliers (de-noise)
Automatic selection of the number of clusters
Extracting cluster keyword labels

You can choose between multiple clustering and featurization methods to find the best combination of methods.

Cluster labels and frequent terms

Cluster labels are the terms that best represent the documents in a given cluster. They are typically words that can be found in the documents closest to the cluster centroid, so the dist_to_center field can be used to sort documents by similarity to the cluster_label field.

Frequent terms are the terms that appear most frequently in documents in a given cluster. Different clusters may have overlapping frequent terms. Some of the frequent terms may also appear in the cluster label.

Configuration tips

The minimum required fields to configure are straightforward:

Training Collection
Output Collection
Field to Vectorize

When you first create a new Document Clustering job, set the Training Collection to point to a special collection that contains a sample set of documents. Set the Output Collection to a new, empty collection. That way, you can test the job quickly over a smaller input collection and you can clear the output collection after each test.

When you create your sample collection and your test output collection, uncheck Enable Signals to prevent secondary collections from being created.

When you are satisfied with the results of the job, set both the Training Collection and the Output Collection to your primary collection. It may take some time for the job to run over your entire primary collection. At the end, the new clustering fields are added to your existing searchable content.

The sections below discuss additional ways to tune your configuration for the best results.

Use stopwords

The quality of your cluster labels depends on the quality of your stopwords. Fusion comes with a very basic stopword list, but there may be other words in your corpus that you want to remove. If you see terms in your cluster labels that your do not want, add them to your stopwords and then re-run the job.

Scale your Spark resources

Make sure that you have enough resources in Spark to run the job, especially if you have many documents or your documents are long. If Spark doesn’t have enough memory, you may see out-of-memory errors in the Job History tab.

If there are obstacles to scalability, you can index a subset of the text from each document then run the job on this downsized version of the documents. The algorithm does not need all of the text to generate meaningful clusters. Stopword removal also decreases the document size significantly.

Select the clustering method

The job provides these clustering methods:

Hierarchical Bisecting Kmeans (“hierarchical”)

The default choice is “hierarchical,” which is a mixed method between Kmeans and hierarchical clustering. It can tackle the problem of uneven cluster sizes produced by standard Kmeans, and is more robust regarding initialization. In addition, it runs much faster than the standard hierarchical-clustering method, and has fewer problems dealing with overlapping topic documents.
Standard Kmeans (“kmeans”)

For use cases such as novel and review clustering, several words can express similar meanings. In that case, Kmeans can perform well in combination with the Word2Vec featurization method described below. This method is also helpful when you have a corpus with a large vocabulary. Kmeans also works well when clusters are "convex", meaning that they are regularly shaped.

There are two ways to configure the number of clusters:

Setting Number Of Clusters speeds up the processing time, but finding the best single value can be difficult unless you know exactly how many clusters are in the dataset.
Setting Minimum Possible Number Of Clusters and Maximum Possible Number Of Clusters) allows Fusion to test up to 20 different values within the configured range to find the best number of clusters for your dataset based on metrics like how far each datapoint is from its associated cluster center. This optimizes the algorithm to detect true groups of similar documents and thus create better-quality clusters.

For example, if kMin=2 and kMax=100, then the job searches through 2, 7, 12, …, 100 with a step size of 5. A large kMax can increase the running time. The algorithm incurs a penalty if k is unnecessarily large. You can use the parameter kDiscount to reduce this penalty and use a larger k-chosen. However, if kMax is small (for example, 10 or less), then not using a discount (kDiscount=1) is recommended.

Select the featurization method

The job provides two text-vectorization methods:

TFIDF

You can trim out noisy terms for TFIDF by specifying the Min Doc Support and Max Doc Support parameters (minimum and maximum number of documents that contain the term).

If you are using the hierarchical clustering method, then you should apply the TFIDF featurization method; it can provide better-detailed clusters for use cases like clustering email or product descriptions.
Word2Vec

Word2Vec can reduce dimensions and extract contextual information by putting co-occurring words in the same subspace. However, it can also lose some detailed information by abstraction. If you assign Word2Vec Dimension an integer greater than 0, then Fusion chooses the Word2Vec method over TFIDF.

If you are using the standard kmeans clustering method, then you should enable the Word2Vec featurization method.

For a large corpus dataset with a big vocabulary, Word2Vec is preferred to help deal with the dimensionality.

Configure de-noise parameters

The job provides three layers of protection from the impact of noisy documents:

In the analyzerConfig parameter, you can specify stopword deletion, stemming, short token treatment, and regular expressions. The analyzerConfig is used in the featurization step.
You can add an optional phase to separate out documents that are extremely long or short (as measured by the number of tokens). Extremely short or long documents can contaminate the clustering process. Documents with a length between Length Threshold for Short Doc and Length Threshold for Long Doc are kept for clustering.
The job performs outlier detection using the Kmeans method. Fusion groups documents into Number of Outlier Groups, then trims out clusters with a size less than Outlier Cutoff as outliers.

Outliers are documents that are too distant from the nearest centroid or nearest neighbors. These distant outliers are grouped into outlier clusters and labeled as outlier_group0, outlier_group1 … outlier_groupN, where N is the value of Number of Outlier Groups.

Configure cluster labelling

The configuration parameter Number Of Keywords For Each Cluster determines the number of keywords to pick to describe each cluster. Test different values until you find that the cluster labels give an accurate depiction of the data contained within a given cluster.

Evaluating and tuning the results

When the job has finished:

Navigate to the output collection.
Open the Query Workbench.
Click Add a Field Facet and select cluster_label.

The cluster_label field contains the five terms that best define the center of each cluster.
Click Add a Field Facet again and select freq_terms.

The freq_terms field contains the five terms that appear most often in each cluster. In many cases, the frequent terms in a cluster are also among those that best define it.

The dist_to_center field can be used to sort documents by similarity to their cluster labels.
Explore the facets to determine whether the clusters are useful.
- Be sure to examine the long_doc cluster and any outlier_group<n> clusters. If no outliers are detected, consider increasing outlierK or outlierThreshold.
- If you see terms that you do not want in your cluster labels, add them to your stopwords.
- To tune the granularity of your clusters, adjust the values of Min Possible Number of Clusters and Max Possible Number of Clusters. Increasing these values will break up some of the bigger groups into smaller ones. Decreasing the values can consolidate smaller groups into larger ones. Experiment until you find the level of granularity that produces the most meaningful clusters.
- If the clusters are very uneven, such as when most documents are in one large cluster:
  
  Try increasing the outlier cutoff (some outlier groups are being labeled as clusters).
  
  Try increasing k (you have many clusters combined into one).
- If you have many outlier groups, but only a few docs per outlier group, try increasing the outlier cutoff to avoid saturating your number of outlier groups before all outliers have been removed.
- If the corpus is large and the clustering job is taking too long:
  
  Use Kmeans and Word2Vec (Kmeans does not usually do well with TF*IDF).
  
  Use an exact k value (Number Of Clusters) instead of a range (Min Possible Number of Clusters and Max Possible Number of Clusters).
- If the clustering job completes successfully according to logs, but is not writing data to Solr, check the schema specs on output fields. Solr strings have a maximum length that Spark strings do not. Use text type if your outputs are very long.
  
  this problem will only appear if reading from and writing to different collections.
Re-run the job and examine the facets again to see whether the results are more useful.

Back up your primary collection

Before running this job over your primary collection, make sure you have a backup of the original content. This can come in handy if you change your mind about the results and want to overwrite the document clustering fields.

Configuration properties

Use this job when you want to cluster a set of documents and attach cluster labels based on topics.

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

trainingCollection - stringrequired

Solr Collection containing documents to be clustered

>= 1 characters

fieldToVectorize - stringrequired

Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

>= 1 characters

dataFormat - stringrequired

Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to use when loading training data if using Solr

Default: *:*

sparkSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

Default: SELECT * from spark_input

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

outputCollection - stringrequired

Solr Collection to store model-labeled data to

>= 1 characters

dataOutputFormat - string

Spark-compatible output format (like 'solr', 'parquet', etc)

>= 1 characters

Default: solr

sourceFields - string

Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

partitionCols - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

uidField - stringrequired

Field containing the unique ID for each document.

>= 1 characters

Default: id

clusterIdField - string

Output field name for unique cluster id.

Default: cluster_id

clusterLabelField - string

Output field name for top frequent terms that are (mostly) unique for each cluster.

Default: cluster_label

freqTermField - string

Output field name for top frequent terms in each cluster. These may overlap with other clusters.

Default: freq_terms

distToCenterField - string

Output field name for doc distance to its corresponding cluster center (measure how representative the doc is).

Default: dist_to_center

minDF - number

Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 5

maxDF - number

Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0.5

kExact - integer

Exact number of clusters.

Default: 0

kMax - integer

Max possible number of clusters.

Default: 20

kMin - integer

Min possible number of clusters.

Default: 2

docLenTrim - boolean

Whether to separate out docs with extreme lengths.

Default: true

outlierTrim - boolean

Whether to perform outlier detection.

Default: true

shortLen - number

Length threshold to define short document. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 5

longLen - number

Length threshold to define long document. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0.99

numKeywordsPerLabel - integer

Number of Keywords needed for labeling each cluster.

Default: 5

modelId - string

Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

>= 1 characters

w2vDimension - integer

Word-vector dimensionality to represent text (choose > 0 to use, suggested dimension ranges: 100~150)

exclusiveMinimum: false

Default: 0

w2vWindowSize - integer

The window size (context words from [-window, window]) for word2vec

>= 3

exclusiveMinimum: false

Default: 8

norm - integer

p-norm to normalize vectors with (choose -1 to turn normalization off)

Default: 2

Allowed values: -1012

analyzerConfig - string

LuceneTextAnalyzer schema for tokenization (JSON-encoded)

>= 1 characters

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "patternreplace", "pattern": "^[\\d.]+$", "replacement": " ", "replace": "all" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

clusteringMethod - string

Choose between hierarchical vs kmeans clustering.

Default: hierarchical

outlierK - integer

Number of clusters to help find outliers.

Default: 10

outlierThreshold - number

Identify as outlier group if less than this percent of total documents. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0.01

minDivisibleSize - number

Clusters must have at least this many documents to be split further. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0

kDiscount - number

Applies a discount to help favor large/small K (number of clusters). A smaller value pushes K to assume a higher value within the [min, max] K range.

Default: 0.7

type - stringrequired

Default: doc_clustering

Allowed values: doc_clustering