hat.engine

Engine of the main train loop in HAT.

Engine

LoopBase

LoopBase controls the data flow from data_loader to model, including model forward, loss backward and parameters update.

Trainer

Trainer is a tool for train, which include all pipeline for training.

Predictor

Predictor is a tool for predict.

Calibrator

Calibrator is a tool for calibration.

distributed_data_parallel_trainer

distributed_data_parallel_trainer is a tool function to new a Trainer instance, which training with DistributedDataParallel method, and running on one of the GPU devices.

data_parallel_trainer

data_parallel_trainer is a tool function to new a Trainer instance, which training with DataParallel method, and running on multiple gpu devices.

Processor

BasicBatchProcessor(need_grad_update, …)

Processor dealing with (inputs, target) batch, and the model output is a (losses, preds) pair.

MultiBatchProcessor(need_grad_update, …)

Processor can forward backward multiple batches within a training step (before optimizer.step()).

API Reference

class hat.engine.Calibrator(model: torch.nn.modules.module.Module, data_loader: Iterable, batch_processor: hat.engine.processors.processor.BatchProcessorMixin, device: Optional[int] = None, num_steps: Optional[int] = None, callbacks: Optional[Sequence[Union[dict, hat.callbacks.callbacks.CallbackMixin]]] = None, val_metrics: Optional[dict] = None, profiler: Optional[dict] = None, log_interval: int = 0)

Calibrator is a tool for calibration.

The abundant callbacks in trainer is also supported.

Parameters
  • modelnn.Module instance.

  • data_loader – Validation data loader.

  • batch_processor – Batch processor config.

  • device – Int gpu id or None.

  • num_steps – Num of calibration steps, should be non-negative integer.

  • callbacks – Callbacks.

  • val_metrics – Metrics on validation data.

  • profiler – To profile individual steps during training and assist in identifying bottlenecks.

  • log_interval – Logging output frequency.

class hat.engine.LoopBase(model: torch.nn.modules.module.Module, data_loader: Iterable, optimizer: torch.optim.optimizer.Optimizer, batch_processor: hat.engine.processors.processor.BatchProcessorMixin, device: Optional[int], stop_by: Optional[str] = 'epoch', num_epochs: Optional[int] = None, start_epoch: Optional[int] = 0, num_steps: Optional[int] = None, start_step: Optional[int] = 0, callbacks: Optional[Sequence[Union[dict, hat.callbacks.callbacks.CallbackMixin]]] = None, train_metrics: Optional[dict] = None, val_metrics: Optional[dict] = None, profiler: Optional[dict] = None, log_interval: int = 0)

LoopBase controls the data flow from data_loader to model, including model forward, loss backward and parameters update.

It is hardware independent, run on cpu (device is None) or gpu (device is int gpu id).

By setting stop_by, you are able to stop loop by counting epoch (default) or step.

Parameters
  • model – Model config or a nn.Module instance.

  • data_loader – Training data loader config or a instantiated data loader.

  • optimizer – Optimizer config or a optimizer instance.

  • batch_processor – Batch processor config or a BatchProcessorMixin instance.

  • device – Int gpu id or None. If int, do model.cuda(device) and data.cuda(device). If None, no-op.

  • stop_by – Stop loop by counting epoch or step. If equal to ‘epoch’, stop loop when epoch_id == num_epochs - 1. If equal to ‘step’, stop loop when global_step_id == num_steps - 1. Default ‘epoch’.

  • num_epochs – Num of loop epochs, should be non-negative integer. If stop_by != ‘epoch’, no-op. Set 0 to skip loop epochs and run self.on_*_loop_begin/end only.

  • start_epoch – Training start epoch, should be non-negative integer.

  • num_steps – Num of loop steps, should be non-negative integer. If stop_by != ‘step’, no-op. Set 0 to skip loop steps and run self.on_*_loop_begin/end only.

  • start_step – Training start step, should be non-negative integer.

  • callbacks – Callback configs or instances.

  • train_metrics – Metrics on training data.

  • val_metrics – Metrics on validation data.

  • profiler – To profile individual steps during loop and assist in identifying bottlenecks.

  • log_interval – Logging output frequency.

fit()

Do model fitting on data from data_loader.

self.batch_processor responsible for model forward, loss backward and parameters update.

self.callbacks responsible for metric update, checkpoint, logging and so on.

class hat.engine.Predictor(model: torch.nn.modules.module.Module, data_loader: Iterable, batch_processor: hat.engine.processors.processor.BatchProcessorMixin, device: Optional[int] = None, num_epochs: int = 1, callbacks: Optional[Sequence[Union[dict, hat.callbacks.callbacks.CallbackMixin]]] = None, metrics: Optional[dict] = None, profiler: Optional[dict] = None, log_interval: int = 0)

Predictor is a tool for predict.

The abundant callbacks in trainer is also supported.

Parameters
  • modelnn.Module instance.

  • data_loader – Validation data loader.

  • batch_processor – Batch processor config.

  • callbacks – Callbacks.

  • num_epochs – Num epochs.

  • metrics – Metrics on predict data.

  • profiler – To profile individual steps during predicting and assist in identifying bottlenecks.

  • log_interval – Logging output frequency.

class hat.engine.Trainer(model: torch.nn.modules.module.Module, data_loader: Iterable, optimizer: torch.optim.optimizer.Optimizer, batch_processor, device: Optional[int], stop_by: Optional[str] = 'epoch', num_epochs: Optional[int] = None, start_epoch: Optional[int] = 0, num_steps: Optional[int] = None, start_step: Optional[int] = 0, callbacks: Optional[Sequence[Union[dict, hat.callbacks.callbacks.CallbackMixin]]] = None, train_metrics: Optional[dict] = None, val_metrics: Optional[dict] = None, profiler: Optional[dict] = None, log_interval: int = 0)

Trainer is a tool for train, which include all pipeline for training.

Parameters
  • model – Model config or a nn.Module instance.

  • data_loader – Training data loader config or a instantiated data loader.

  • optimizer – Optimizer config or a optimizer instance.

  • batch_processor – Batch processor config or a BatchProcessorMixin instance.

  • device – Int gpu id or None. If int, do model.cuda(device) and data.cuda(device). If None, no-op.

  • stop_by – Stop training by counting epoch or step. If equal to ‘epoch’, stop training when epoch_id == num_epochs - 1. If equal to ‘step’, stop training when global_step_id == num_steps - 1. Default ‘epoch’.

  • num_epochs – Num of training epochs, should be non-negative integer. If stop_by != ‘epoch’, no-op. Set 0 to skip training and run self.on_loop_begin/end only.

  • start_epoch – Training start epoch, should be non-negative integer.

  • num_steps – Num of training steps, should be non-negative integer. If stop_by != ‘step’, no-op. Set 0 to skip training and run self.on_loop_begin/end only.

  • start_step – Training start step, should be non-negative integer.

  • callbacks – Callback configs or instances.

  • train_metrics – Metrics on training data.

  • val_metrics – Metrics on validation data.

  • profiler – To profile individual steps during training and assist in identifying bottlenecks.

  • log_interval – Logging output frequency.

hat.engine.data_parallel_trainer(model: torch.nn.modules.module.Module, data_loader: Iterable, optimizer: torch.optim.optimizer.Optimizer, batch_processor: hat.engine.processors.processor.BatchProcessorMixin, device: Union[int, Sequence[int]], stop_by: Optional[str] = 'epoch', num_epochs: Optional[int] = None, start_epoch: Optional[int] = 0, num_steps: Optional[int] = None, start_step: Optional[int] = 0, callbacks: Optional[Sequence[Union[dict, hat.callbacks.callbacks.CallbackMixin]]] = None, train_metrics: Optional[dict] = None, val_metrics: Optional[dict] = None, profiler: Optional[dict] = None)hat.engine.trainer.Trainer

data_parallel_trainer is a tool function to new a Trainer instance, which training with DataParallel method, and running on multiple gpu devices.

It can be launched by launch function below.

By setting stop_by, you are able to stop training by counting epoch (default) or step.

Parameters
  • model – Model config or a nn.Module instance.

  • data_loader – Training data loader config or a instantiated data loader.

  • optimizer – Optimizer config or a optimizer instance.

  • batch_processor – Batch processor config or a BatchProcessorMixin instance.

  • device – GPU ids.

  • stop_by – Stop training by counting epoch or step. If equal to ‘epoch’, stop training when epoch_id == num_epochs - 1. If equal to ‘step’, stop training when global_step_id == num_steps - 1. Default ‘epoch’.

  • num_epochs – Num of training epochs, should be non-negative integer. If stop_by != ‘epoch’, no-op. Set 0 to skip training and run self.on_loop_begin/end only.

  • start_epoch – Training start epoch, should be non-negative integer.

  • num_steps – Num of training steps, should be non-negative integer. If stop_by != ‘step’, no-op. Set 0 to skip training and run self.on_loop_begin/end only.

  • start_step – Training start step, should be non-negative integer.

  • callbacks – Callback configs or instances.

  • train_metrics – Metrics on training data.

  • val_metrics – Metrics on validation data.

  • profiler – To profile individual steps during training and assist in identifying bottlenecks.

Returns

Trainer instance.

hat.engine.distributed_data_parallel_trainer(model: torch.nn.modules.module.Module, data_loader: Iterable, optimizer: torch.optim.optimizer.Optimizer, batch_processor: hat.engine.processors.processor.BatchProcessorMixin, device: int, stop_by: Optional[str] = 'epoch', num_epochs: Optional[int] = None, start_epoch: Optional[int] = 0, num_steps: Optional[int] = None, start_step: Optional[int] = 0, callbacks: Optional[Sequence[hat.callbacks.callbacks.CallbackMixin]] = None, sync_bn: Optional[bool] = False, sync_bn_by_host: Optional[bool] = False, train_metrics: Optional[dict] = None, val_metrics: Optional[dict] = None, profiler: Optional[dict] = None, task_sampler: object = None)hat.engine.trainer.Trainer

distributed_data_parallel_trainer is a tool function to new a Trainer instance, which training with DistributedDataParallel method, and running on one of the GPU devices.

It can be launched by launch function below, which spawns multiple processes and each of it owns an independent Trainer.

By setting stop_by, you are able to stop training by counting epoch (default) or step.

Parameters
  • model – Model config or a nn.Module instance.

  • data_loader – Training data loader config or a instantiated data loader.

  • optimizer – Optimizer config or a optimizer instance.

  • batch_processor – Batch processor config or a BatchProcessorMixin instance.

  • device – GPU id.

  • stop_by – Stop training by counting epoch or step. If equal to ‘epoch’, stop training when epoch_id == num_epochs - 1. If equal to ‘step’, stop training when global_step_id == num_steps - 1. Default ‘epoch’.

  • num_epochs – Num of training epochs, should be non-negative integer. If stop_by != ‘epoch’, no-op. Set 0 to skip training and run self.on_loop_begin/end only.

  • start_epoch – Training start epoch, should be non-negative integer.

  • num_steps – Num of training steps, should be non-negative integer. If stop_by != ‘step’, no-op. Set 0 to skip training and run self.on_loop_begin/end only.

  • start_step – Training start step, should be non-negative integer.

  • callbacks – Callback configs or instances.

  • sync_bn – Whether to convert bn to sync_bn.

  • sync_bn_by_host – Whether sync bn within host node

  • train_metrics – Metrics on training data.

  • val_metrics – Metrics on validation data.

  • profiler – To profile individual steps during training and assist in identifying bottlenecks.

  • task_sampler – TaskSampler config for multitask training.

Returns

Trainer instance.

class hat.engine.processors.BasicBatchProcessor(need_grad_update: bool, batch_transforms: Optional[List] = None, loss_collector: Optional[Callable] = None, enable_amp: bool = False, enable_apex: bool = False)

Processor dealing with (inputs, target) batch, and the model output is a (losses, preds) pair.

It is suitable for training (need_grad_update) or validation (not need_grad_update).

Parameters
  • need_grad_update – Whether need gradient update, True for training, False for Validation.

  • batch_transforms – Config of batch transforms.

  • loss_collector – A callable object used to collect loss Tensors in model outputs.

  • enable_amp – Whether training with Automatic Mixed Precision.

  • enabel_apex – Whether training with Apex.

class hat.engine.processors.BatchProcessorMixin

Batch Processor Interface.

class hat.engine.processors.MultiBatchProcessor(need_grad_update: bool, batch_transforms: Optional[List] = None, loss_collector: Optional[Callable] = None, enable_amp: bool = False, enable_apex: bool = False, delay_sync: bool = False)

Processor can forward backward multiple batches within a training step (before optimizer.step()).

It is useful for:

(1) Training a multitask model on single task annotation samples, of which each task forward backward its batch sequentially within a multitask training step

(2) Training on a memory shortage GPU and want to increase batch size, you are able to forward backward multiple batches within a training step

Note

Example multitask: vehicle, person and traffic light detection. Single task annotation means only annotate vehicle bounding boxes on an image with vehicle, person, and traffic light objects.

Note

Multiple batches should be organized in tuple format, e.g.

  • batch = (batch1, batch2, …)

If not, it will be treated as a single batch, e.g.

  • batch = dict(inputs=xx, target=xx)

  • batch = [inputs, target]

See code below for extra explanation.

It is much general in usage than BasicBatchProcessor , batch and model outputs can be in any format, but note that if batch is a tuple means it contains multiple batches.

It is Hardware independent, run on cpu (device None) or gpu (device is gpu id).

It is suitable for training (need_grad_update) and validation (not need_grad_update).

Parameters
  • need_grad_update – Whether need gradient update, True for training, False for Validation.

  • batch_transforms – Config of batch transforms.

  • loss_collector – A callable object used to collect loss Tensors in model outputs.

  • enable_amp – Whether training with Automatic Mixed Precision.

  • enabel_apex – Whether training with Apex.

  • delay_sync – Whther delay sync grad when train on DDP. Refer to: DDP.no_sync() API

hat.engine.processors.collect_loss_by_index(indexes: Union[int, Sequence[int]])Callable

Collect loss by specific indexes of loss Tensors in model outputs like: (losses, preds) , (… loss1, … loss2, …) and so on.

Parameters

indexes – Indexes of loss Tensors in model outputs.

Returns

A function with model outputs as input, return loss Tensors collected by indexes.

Examples:

>>> model_outs = [
...     [torch.tensor(1.0), torch.tensor(2.0)],  # losses
...     [torch.tensor(3.0), torch.tensor(4.0)]   # preds
... ]
>>> collector = collect_loss_by_index(0)
>>> collector(model_outs)
[tensor(1.), tensor(2.)]
hat.engine.processors.collect_loss_by_regex(loss_name_pattern: str)Callable

Flatten model outputs into an OrderedDict, then using re regex to match the keys of loss Tensors.

Parameters

loss_name_patternre regex, e.g. ‘^.*loss.*’ .

Returns

A function with model outputs as input, return loss Tensors matched by loss_name_pattern.

Example:

>>> model_outs = dict(
...     toy_loss_1=torch.tensor(1.0),
...     toy_predict=torch.tensor(2.0),
...     toy_loss_2=torch.tensor(3.0),
... )
>>> collector = collect_loss_by_regex('^.*loss.*')
>>> collector(model_outs)
[tensor(1.), tensor(3.)]