Training a model

Table of Contents

Model training teaches an existing deep learning model how to process documents that are representative of those the model will encounter in your production environment.

If required for your solution, model training occurs after or in conjunction with annotation, and before building a flow. Model training is typically handled by a solution developer. See the Solution Builder guide for core tasks related to model training, including creating a model and starting a training run.

Marketplace models vs. base models

Instabase offers two types of deep learning models:

Marketplace models have been pre-trained on common document types. If the documents you’re processing match an existing Marketplace model, you can often skip model training or fast-track it by performing limited additional training. Retraining a Marketplace model is supported for classification and extraction models only, not table extraction models.
Base models are deep learning models that have been trained on large collections of diverse documents or texts. Base models typically require more training than Marketplace models, because they start with less familiarity with specific documents.

Both Marketplace models and base models are provided through the Marketplace. Base models are updated regularly to take advantage of the latest advancements in deep learning. For instructions on updating your base models, see Update Marketplace.

You can also publish and share your own trained models through the Marketplace. The Marketplace is local to your Instabase instance; your published models aren’t shared with other customers.

Importing and evaluating Marketplace models

Before using a Marketplace model, you must import it into an existing model project. After import, you can optionally evaluate the model to see how it performs on your annotation set. Evaluation performs a test training run without changing the underlying deep learning model.

Tip

You can import and evaluate multiple Marketplace models to compare how they perform with your annotation set.

With your model open in ML Studio, in the Imported Models section, click Import.
Click Import Model for any model you want to import.

The model is imported from Marketplace into your empty model.
(Optional) To evaluate the model, in the imported model card, click Evaluate. Adjust any evaluation hyperparameters and click Evaluate.

Note that you can also evaluate a trained model from your project that has not been published. See Selecting a model.

Training requirements for annotation sets

Model training requires a minimum of five training records and two test records, so your annotation set must include at least seven records that are marked as annotated. In sets with manual test record selection enabled, you must select at least two of your annotated records as test records.

You must meet the minimum annotation requirement for each type of model you’re training. For example, if your solution requires a classification model, extraction model, and table extraction model, you must annotate at least seven records with classification, seven records with text extraction, and seven records with table extraction.

Marking records as annotated helps Instabase determine when training and testing requirements are met, but during training, all records with annotations are used, regardless of whether they’re marked as annotated.

Although you can train a model with a small annotation set, model performance increases with more annotations to train with. In fact, providing more annotations is generally the best way to improve model performance, especially with unstructured or irregular documents.

Model training options

Use this reference for configuration options as you create and train models.

Model types

When you create an ML Studio project in Instabase, you must specify a model type that corresponds with the data processing task you’re tackling.

Classification models predict the class—or type—of record, splitting multipage documents if needed.
Extraction models extract text.
Table extraction models extract tables, or fields within tables.

Often, document processing solutions require some combination of these model types. For example, if you want to classify documents as well as extract text and tables, you must create three models to perform each of those tasks. You can use the same annotation set for multiple models, and you can use multiple models in a single flow or solution.

Selecting a model

By default, a base model that matches your model type is selected when you begin model training. To choose a different model to iterate on, click the edit icon in the Select a model section. Available models are displayed on separate tabs.

Base models includes any generalized deep learning models that suit your model type.

Note

In classification models, if you want to split multipage documents before classifying, select one of the layout_split_classifier* base models.
Imported models includes any models you imported from the Marketplace, whether published by Instabase or another user in your instance.
Trained models includes any previously trained models available in your Solution Builder project or ML Studio project. Retraining a model generates a new trained model with unique metadata, so the existing trained model is preserved.

If an imported Marketplace model or a previously trained model matches your document processing requirements, select it over a base model, because it will likely require less training than a base model.

You can perform training runs on multiple models to see how they perform, and then choose the best fit.

Test / train split

For models that use an annotation set with automatic test record selection, you can specify the percent of records to use for testing. The default and a good rule of thumb for test records is 30 percent. To specify a different test percentage for multiple connected annotation sets, select Edit per annotation set.

The Randomization Seed determines which records the model uses for testing. By default, the value is set to 0, which maintains the same set of test records on multiple, identical training runs. To randomize test record selection across multiple training runs, specify a different seed for each run.

If your annotation set specifies test records manually, you can’t customize testing percentage: the model uses whichever records you marked for testing.

Hyperparameters

Hyperparameters control the configuration of model evaluation and training runs. Available hyperparameters vary depending on the model architecture you select.

Use this reference to understand the most common hyperparameters, or refer to the ibformers documentation for more details.

Create validation set

Enable this option to perform a training run using a random subset of documents from your annotation set. A validation set is required to enable certain features, including early stopping, hyperparameter optimization, and calibration.

Validation sets are used in conjunction with several advanced hyperparameters:

validation_set_size specifies the percent of an annotation set to use as a validation set. The default value of 10 percent is suitable in most cases. Increase this value if you want more dependable validation results.
do_hyperparam_optimization tests various hyperparameter combinations to see how they affect model performance. Optimal settings are output to logs and used to train the final model artifact.
early_stopping_patience specifies the number of additional epochs to run after model performance peaks, and uses the epoch with the best performance in the final model artifact. The default value of 3 is suitable in most cases. For smaller annotation sets, you might want to increase the value to 5–7 to account for increased randomness.

Number of epochs

Use this option to specify the number of iterations, or epochs, completed in a training run. Increasing the number of epochs might improve model performance, especially with smaller annotation sets. Alternatively, too many epochs can cause overfitting.

Tip

You can look for overfitting by reviewing model performance after each epoch to see if performance deteriorates at a certain point.

Run annotation analysis job

Enable this option to have the model generate suggestions and error probability on previously annotated fields. You can then use this information in the corresponding annotation set to verify the accuracy of existing annotations.

Annotation analysis jobs generate specialized output, so model versions with this option enabled can’t be used in flows or solutions.

Tip

Annotation suggestions—without error probability—are generated by default for unannotated classes or fields after training or evaluation runs.

For details about using annotation analysis, see Using annotation assist.

Enable list support

Select this option to treat multiple annotated values associated with a list field as separate items. If your annotation set includes list annotations, this option is selected by default.

If list support is disabled, multiple annotations associated with a list field are concatenated into a single value.

Assessing model performance

After an evaluation or training run completes, click View details to see metrics, logs, and other information that can help you assess model performance.

Metrics

See how your model performed by viewing Metrics. Model performance is measured in several different ways, depending on model type. For details about metrics, refer to the ibformers documentation.

These key terms are helpful in understanding metrics:

Recall measures how well the model performed at retrieving all relevant information, such as records of a certain class or data points for a specific field.
Precision measures how well the model performed at retrieving only relevant information.
F1 calculates the harmonic mean of recall and precision. This is often the metric of choice if you need a single measure of model performance.
ECE indicates confidence in the model’s predictions.

Metrics are calculated at multiple levels of granularity, so you can focus on the performance indicators that matter most to you or pinpoint problem areas. For example, you can use metrics for a specific class or field to determine whether to annotate more documents with that class or field. Similarly, you can compare how the model performed at the class or field level to its performance on more discrete measures for those data points: tokens, which make up fields, and chunks, which make up records. Significant discrepancies in granular metrics might indicate deficiencies with record digitization.

For a JSON download of metrics and related data, including test / train record counts, confusion matrix, and train-validation curve data, click Download near top right. For details about what’s stored in this JSON, see the Model API metrics response schema.

Configuration

Review the model and hyperparameters used for training, see details about the annotation set used, and publish or prune the model.

Test records

See what kind of mistakes your model made by viewing Test Records. The model’s real-world performance on test records can identify problem areas that aren’t apparent from metrics, and suggest corrective actions. For example, if the model frequently misses the second line of a name on driver’s licenses, you can consider adding to your annotation set more driver’s licenses where the name appears on two lines.

Logs

See details about how your model performed throughout a training run by viewing Logs. For example, you can review how the model performed after each epoch, and then limit future training runs to the point at which the model stops improving. Logs are also helpful to troubleshoot failed training runs.

Pruning models

Pruning a trained model can improve its speed and efficiency, reduce memory consumption, and might make the model better at generalizing, thereby improving its accuracy. Model pruning uses an existing trained model to train a smaller, faster model.

Note

Pruning is supported only for instalm_base and instalmv3 extraction models. Other models don’t display the prune option.

With your trained model open in ML Studio, on the Train & Evaluate tab, click View details on the model you want to prune.
Click the Configuration tab, then click Prune.
Adjust Model compression after pruning to remove a fraction of the model weights.

Tip

Model compression values of 0.7 or 0.8 were effective in testing. Higher values, such as 0.9 or 0.95, create smaller models, but are more likely to degrade accuracy.

Publishing models

Publish a model to make it available for use throughout your environment.

With your trained model open in ML Studio, on the Train & Evaluate tab, click View details on the model you want to publish.
Click the Configuration tab, then click Publish.
Choose whether to publish your model as a classic model or a Marketplace model, depending on how you want to use it, then click Publish.

Publishing your model as a Marketplace model makes your model searchable and usable by others in your Instabase instance. If you’re publishing a Marketplace model, you’re prompted to provide additional details about the model, such as description, demo documents, and metadata tags.

Note

To unpublish a model, from the model configuration in ML Studio, click Unpublish. Unpublishing breaks any flows that use the model.

Importing legacy models

Models trained with earlier Instabase versions can’t be published or imported into version 23.01, because the package metadata was changed to support new features.

To use a legacy model with Instabase 23.01 or later, you can either retrain the model using Instabase 23.01 or later and ibformers v2 or later, or manually edit the model artifact to make it compatible with release 23.01.

Follow these steps to manually edit the model artifact.

Open the model’s package.json file. The method you use to access this file differs depending on whether you have access to the ML Studio model.
If you have access to the ML Studio model...
1. In the Instabase Explorer, locate the training job folder.
2. In the training job folder, open the artifact/ folder, and select the package.json file. The file opens in the Text Editor app.
If you have access only to the solution file...
1. In the Instabase Explorer, locate the model’s .ibsolution file.
2. Rename the .ibsolution file by adding a .zip file extension to the end, and unzip the file.
3. Select the package.json file located in the top-level directory of the unzipped file. The file opens in the Text Editor app.
Edit the package.json file to add a "model_type" key.

For extraction models, the value is ["extraction"]. For classification models, the value is ["classification"]. For example:
```
{
    "name": "your_model_name",
    ...
    "model_type": ["extraction"]
}
```
Save your changes.
Republish the model from ML Studio, or repackage the folder into a solution and republish it.