Fine-Tuning Task
The fine tuning task aims to fine tune a lora model for a pretrained stable diffusion model. It has two components:
A generalized schema to define a Fine tuning task.
An execution engine that runs the task defined in the above schema.
The task definition is represented in the key-value pairs that can be transformed into, among many other formats, a JSON string, which can be validated using a JSON schema. And the validation tools exist for most of the popular programming languages.
The execution engine is integrated into the node of the Hydrogen Network, and the JSON string format of the task definition is used to send tasks in the Hydrogen Network.
Fine tuning Task definition
The following is an intuitive look at a task definition:
Full example of the fine tuning task can be found in the GitHub repository.
model
Model defines the pretrained base model for fine tuning.
The argument name
defines the name of pretrained base model.
The mode name could be the original Stable Diffusion models, such as the Stable Diffusion 1.5 and the Stable Diffusion XL, or a checkpoint that is fine-tuned based on the original Stable Diffusion models.
The model name should be a Huggingface model ID.
The argument variant
means the model dtype variant, can be null (no variant), fp16 (float16), bf16 (bfloat16). Default is null
.
The argument revision
means the model revision, can be main or a commit hash of the model repo. Default is main
.
dataset
Dataset defines the dataset to train on.
The argument name
defines the name of the dataset to train on. It can be Huggingface dataset ID, or a path pointing to a local copy of a dataset in your filesystem.
The argument config_name
defines the config file name of the dataset, leave as null if there is only one config. Default is null
.
The argument image_column
defines the column of the dataset containing an image. Default is "image"
.
The argument caption_column
defines the column of the dataset containing a caption. Default is "text"
.
validation
Validation defines the prompt for the validation inference.
The argument prompt
defines the prompt for the validation inference. It should be a string or null. When prompt
is null, we will random select num_images
prompt from the dataset for inference.
train_args
Train args defines the arguments for training.
The argument learning_rate
defines the initial learning rate (after potentail warmup period) to use.
The argument batch_size
defines the batch size for training dataloader.
The argument gradient_accumulation_steps
defines the number of updates steps to accumulate before performing a backward/update pass.
The argument prediction_type
defines the prediction_type that shall be used for training. Choose between 'epsilon' or 'v_prediction' or leave None
. If left to None
the default prediction type of the scheduler: noise_scheduler.config.prediction_type
is chosen.
The argument max_grad_norm
defines the max gradient norm for clipping gradient norm in training.
Usually the training progress will take a long time to complete. We cannot run the whole training progress in one task, because each task has a max execution time limit in the crynux network, and the time limit is too short to complete the whole training progress. So we need to split the whole training progress into serveral tasks, each task runs only a few steps of the training progress and save its result. The next task will use the previous task result as its base model to continue the training. This progress will repeat until the whole training is completed. We use arguments num_train_epochs
or num_train_steps
to define the epochs or updates steps performed in one task, and arguments max_train_epochs
or max_train_steps
to define the epochs or updates steps the whole training progress takes. If num_train_steps
and max_train_steps
are provided, they will overrided num_train_epochs
and max_train_epochs
, respectively.
The argument scale_lr
defines the whether to scale the learning rate by number of GPUs, gradient accumulation steps and batch size. Default is true.
The argument resolution
defines the resolution for the input images. All the images in the train/validation dataset will be resize to this resolution.
The argument noise_offset
defines the scale of noise offset.
The argument snr_gamma
defines the SNR weighting gamma to be used if rebalancing the loss. Recommended value is 5.0. Default is null, means not to rebalance the loss.
lr_scheduler
The argument lr_scheduler
defines the learning rate scheduler type to use. Choose between ["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"].
The argument lr_warmup_steps
defines the number of steps for the warmup in the lr scheduler.
adam_args
The argument adam_args
defines parameters for the Adam optimizer.
The argument beta1
defines the beta1 parameter for the Adam optimizer.
The argument beta2
defines the beta2 parameter for the Adam optimizer.
The argument weight_decay
defines the weight decay to use.
The argument epsilon
defines the epsilon value for the Adam optimizer.
lora
Lora defines arguments for the lora layers.
The argument rank
defines the dimension of the LoRA attention.
The argument init_lora_weights
defines how to initialize the weights of the adapter layers. Can be a boolean or choose between ["gaussian", "loftq"]. Passing True (default) results in the default initialization from the reference implementation from Microsoft. Passing ‘gaussian’ results in Gaussian initialization scaled by the LoRA rank for linear and layers. Setting the initialization to False leads to completely random initialization and is discouraged. Pass 'loftq' to use LoftQ initialization.
The argument target_modules
defines the names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. When passing a string, a regex match will be performed.
transforms
Transforms defines the tranform operations applied to the image before training.
The argument center_crop
defines whether to center crop the input images to the resolution. If not set, the images will be randomly cropped. The images will be resized to the resolution first before cropping.
The argument random_flip
defines whether to randomly flip images horizontally.
dataloader_num_workers
The argument dataloader_num_workers
defines the number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.
mixed_precision
The argument mixed_precision
defines whether to use mixed precision in training. Choose between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >=1.10 and an Nvidia Ampere GPU. No means to disable the mixed precision.
seed
The argument seed
defines the seed used to initialize the random processes.
Helium Network requires a deterministic algorithm for text generation, which means the text generated on the different nodes of the same deivces, given the same task definition, should be the same. This is a requirement for the consensus protocol to work. The seed is left as a required argument in the task definition so that all the nodes could use the same seed to initialize their random number generators, which will hopefully produce the same random numbers across all the nodes.
Beside the seed, the GPT Task Framework has been implemented to maximize the reproducibility.
checkpoint
The argument checkpoint
defines whether this task should be resumed from a previous checkpoint. It should be a directory containing the checkpoint files in your local file system. If the task is executed in the Hydrogen Network, this parameter will be injected automatically if the checkpoint is provided.
Fine tuning Task Response
The fine tuning task response are two directories checkpoint
and validation
, stores the checkpoint files and validation result images, respectively.
The checkpoint files can be used as the final lora weights, or as the checkpoint the next task to be resumed from. The validation result images can be used to check the model quality.
When executing the fine tuning task, you need to pass an argument output_dir
to specify where the checkpoint
and validation
directories will be stored. If the task is executed in the Hydrogen Network, the output_dir
parameter will be injected automatically.
Last updated