torchvideo.transforms

This module contains video transforms similar to those found in torchvision.transforms specialised for image transformations. Like the transforms from torchvision.transforms you can chain together successive transforms using torchvision.transforms.Compose.

Target parameters

All transforms support a target parameter. Currently these don’t do anything, but allow you to implement transforms on targets as well as frames. At some point in future it is the intention that we’ll support transforms of things like masks, or allow you to plug your own target transforms into these classes.

Examples

Typically your transformation pipelines will be compose of a sequence of PIL video transforms followed by a CollectFrames transform and a PILVideoToTensor: transform.

import torchvideo.transforms as VT
import torchvision.transforms as IT
from torchvision.transforms import Compose

transform = Compose([
    VT.CenterCropVideo((224, 224)),  # (h, w)
    VT.CollectFrames(),
    VT.PILVideoToTensor()
])

Optical flow stored as flattened \((u, v)\) pairs like \((u_0, v_1, u_1, v_1, \ldots, u_n, v_n)\) that are then stacked into the channel dimension would be dealt with like so:

import torchvideo.transforms as VT
import torchvision.transforms as IT
from torchvision.transforms import Compose

transform = Compose([
    VT.CenterCropVideo((224, 224)),  # (h, w)
    VT.CollectFrames(),
    VT.PILVideoToTensor(),
    VT.TimeToChannel()
])

Video Datatypes

torchvideo represents videos in a variety of formats:

  • PIL video: A list of a PIL Images, this is useful for applying image data augmentations

  • tensor video: A torch.Tensor of shape \((C, T, H, W)\) for feeding a network.

  • NDArray video: A numpy.ndarray of shape either \((T, H, W, C)\) or \((C, T, H, W)\). The reason for the multiple channel shapes is that most loaders load in \((T, H, W, C)\) format, however tensors formatted for input into a network typically are formatted in \((C, T, H, W)\). Permuting the dimensions is a costly operation, so supporting both format allows for efficient implementation of transforms without have to invert the conversion from one format to the other.

Composing Transforms

Transforms can be composed with Compose. This functions in exactly the same way as torchvision’s implementation, however it also supports chaining transforms that require, or optionally support, or don’t support a target parameter. It handles the marshalling of targets around and into those transforms depending upon their support allowing you to mix transforms defined in this library (all of which support a target parameter) and those defined in other libraries.

Additionally, we provide a IdentityTransform that has a nicer __repr__ suitable for use as a default transform in Compose pipelines.

Compose

class torchvideo.transforms.Compose(transforms)[source]

Bases: object

Similar to torchvision.transforms.transforms.Compose except supporting transforms that take either a mandatory or optional target parameter in __call__. This facilitates chaining a mix of transforms: those that don’t support target parameters, those that do, and those that require them.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)[source]

Call self as a function.

IdentityTransform

class torchvideo.transforms.IdentityTransform[source]

Bases: torchvideo.transforms.transforms.transform.StatelessTransform

Identity transformation that returns frames (and labels) unchanged. This is primarily of use when conditionally adding in transforms and you want to default to a transform that doesn’t do anything. Whilst you could just use an identity lambda this transform has a nicer repr that shows that no transform is taking place.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.


Transforms on PIL Videos

These transforms all take an iterator/iterable of PIL.Image.Image and produce an iterator of PIL.Image.Image. To materialize the iterator the you should compose your sequence of PIL video transforms with CollectFrames.

CenterCropVideo

class torchvideo.transforms.CenterCropVideo(size)[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Crops the given video (composed of PIL Images) at the center of the frame.

Parameters

size (sequence or int) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

RandomCropVideo

class torchvideo.transforms.RandomCropVideo(size, padding=None, pad_if_needed=False, fill=0, padding_mode='constant')[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Crop the given Video (composed of PIL Images) at a random location.

Parameters
  • size (Union[Tuple[int, int], int]) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made.

  • padding (Union[Tuple[int, int, int, int], Tuple[int, int], None]) – Optional padding on each border of the image. Default is None, i.e no padding. If a sequence of length 4 is provided, it is used to pad left, top, right, bottom borders respectively. If a sequence of length 2 is provided, it is used to pad left/right, top/bottom borders, respectively.

  • pad_if_needed (bool) – Whether to pad the image if smaller than the desired size to avoid raising an exception.

  • fill (int) – Pixel fill value for constant fill. If a tuple of length 3, it is used to fill R, G, B channels respectively. This value is only used when the padding_mode is 'constant'.

  • padding_mode (str) –

    Type of padding. Should be one of: 'constant', 'edge', 'reflect' or 'symmetric'.

    • 'constant': pads with a constant value, this value is specified with fill.

    • 'edge': pads with the last value on the edge of the image.

    • 'reflect': pads with reflection of image (without repeating the last value on the edge) padding [1, 2, 3, 4] with 2 elements on both sides in reflect mode will result in [3, 2, 1, 2, 3, 4, 3, 2].

    • 'symmetric': pads with reflection of image (repeating the last value on the edge) padding [1, 2, 3, 4] with 2 elements on both sides in symmetric mode will result in [2, 1, 1, 2, 3, 4, 4, 3].

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

RandomHorizontalFlipVideo

class torchvideo.transforms.RandomHorizontalFlipVideo(p=0.5)[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Horizontally flip the given video (composed of PIL Images) randomly with a given probability \(p\).

Parameters

p (float) – probability of the image being flipped.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

ResizeVideo

class torchvideo.transforms.ResizeVideo(size, interpolation=2)[source]

Bases: torchvideo.transforms.transforms.transform.StatelessTransform

Resize the input video (composed of PIL Images) to the given size.

Parameters
  • size (sequence or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).

  • interpolation (int, optional) – Desired interpolation. Default is PIL.Image.BILINEAR (see PIL.Image.Image.resize() for other options).

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

MultiScaleCropVideo

class torchvideo.transforms.MultiScaleCropVideo(size, scales=(1, 0.875, 0.75, 0.66), max_distortion=1, fixed_crops=True, more_fixed_crops=True)[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Random crop the input video (composed of PIL Images) at one of the given scales or from a set of fixed crops, then resize to specified size.

Parameters
  • size (sequence or int) – Desired output size. If size is an int instead of sequence like (h, w), a square image (size, size) is made.

  • scales (sequence) – A sequence of floats between in the range \([0, 1]\) indicating the scale of the crop to be made.

  • max_distortion (int) – Integer between 0–len(scales) that controls aspect-ratio distortion. This parameters decides which scales will be combined together when creating crop boxes. A max distortion of 0 means that the crop width/height have to be from the same scale, whereas a distortion of 1 means that the crop width/height can be from 1 scale before or ahead in the scales sequence thereby stretching or squishing the frame.

  • fixed_crops (bool) – Whether to use upper right, upper left, lower right, lower left and center crop positions as the list of candidate crop positions instead of those generated from scales and max_distortion.

  • more_fixed_crops (bool) – Whether to add center left, center right, upper center, lower center, upper quarter left, upper quarter right, lower quarter left, lower quarter right crop positions to the list of candidate crop positions that are randomly selected. fixed_crops must be enabled to use this setting.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

RandomResizedCropVideo

class torchvideo.transforms.RandomResizedCropVideo(size, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=2)[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Crop the given video (composed of PIL Images) to random size and aspect ratio.

A crop of random scale (default: \([0.08, 1.0]\)) of the original size and a random scale (default: \([3/4, 4/3]\)) of the original aspect ratio is made. This crop is finally resized to given size. This is popularly used to train the Inception networks.

Parameters
  • size (Union[Tuple[int, int], int]) – Desired output size. If size is an int instead of sequence like (h, w), a square image (size, size) is made.

  • scale (Tuple[float, float]) – range of size of the origin size cropped.

  • ratio (Tuple[float, float]) – range of aspect ratio of the origin aspect ratio cropped.

  • interpolation – Default: PIL.Image.BILINEAR (see PIL.Image.Image.resize() for other options).

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

TimeApply

class torchvideo.transforms.TimeApply(img_transform)[source]

Bases: torchvideo.transforms.transforms.transform.StatelessTransform

Apply a PIL Image transform across time.

See torchvision.transforms for suitable deterministic transforms to use with meta-transform.

Warning

You should only use this with deterministic image transforms. Using a transform like torchvision.transforms.RandomCrop will randomly crop each individual frame at a different location producing a nonsensical video.

Parameters

img_transform (Callable[[Image], Image]) – Image transform operating on a PIL Image.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.


Transforms on Torch.*Tensor videos

These transform are applicable to torch.*Tensor videos only. The input to these transforms should be a tensor of shape \((C, T, H, W)\).

NormalizeVideo

class torchvideo.transforms.NormalizeVideo(mean, std, channel_dim=0, inplace=False)[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Normalise torch.*Tensor \(t\) given mean: \(M = (\mu_1, \ldots, \mu_n)\) and std: \(\Sigma = (\sigma_1, \ldots, \sigma_n)\): \(t'_c = \frac{t_c - M_c}{\Sigma_c}\)

Parameters
  • mean (Union[Sequence[Number], Number]) – Sequence of means for each channel, or a single mean applying to all channels.

  • std (Union[Sequence[Number], Number]) – Sequence of standard deviations for each channel, or a single standard deviation applying to all channels.

  • channel_dim (int) – Index of channel dimension. 0 for 'CTHW' tensors and ` for 'TCHW' tensors.

  • inplace (bool) – Whether or not to perform the operation in place without allocating a new tensor.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

TimeToChannel

class torchvideo.transforms.TimeToChannel[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Combine time dimension into the channel dimension by reshaping video tensor of shape \((C, T, H, W)\) into \((C \times T, H, W)\)

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.


Conversion transforms

These transforms are for converting between different video representations. Typically your transformation pipeline will operate on iterators of PIL images which will then be aggregated by CollectFrames and then coverted to a tensor via PILVideoToTensor.

CollectFrames

class torchvideo.transforms.CollectFrames[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Collect frames from iterator into list.

Used at the end of a sequence of PIL video transformations.

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

PILVideoToTensor

class torchvideo.transforms.PILVideoToTensor(rescale=True, ordering='CTHW')[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Convert a list of PIL Images to a tensor \((C, T, H, W)\) or \((T, C, H, W)\).

Parameters
  • rescale (bool) – Whether or not to rescale video from \([0, 255]\) to \([0, 1]\). If False the tensor will be in range \([0, 255]\).

  • ordering (str) – What channel ordering to convert the tensor to. Either ‘CTHW’ or ‘TCHW’

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.

NDArrayToPILVideo

class torchvideo.transforms.NDArrayToPILVideo(format='thwc')[source]

Bases: torchvideo.transforms.transforms.transform.Transform

Convert numpy.ndarray of the format \((T, H, W, C)\) or \(( C, T, H, W)\) to a PIL video (an iterator of PIL images)

Parameters

format – dimensional layout of array, one of "thwc" or "cthw"

__call__(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)

Call self as a function.


Functional Transforms

Functional transforms give you fine-grained control of the transformation pipeline. As opposed to the transformations above, functional transforms don’t contain a random number generator for their parameters.

normalize

torchvideo.transforms.functional.normalize(tensor, mean, std, channel_dim=0, inplace=False)[source]

Channel-wise normalize a tensor video of shape \((C, T, H, W)\) with mean and standard deviation

See NormalizeVideo for more details.

Parameters
  • tensor (Tensor) – Tensor video of size \((C, T, H, W)\) to be normalized.

  • mean (Sequence) – Sequence of means, \(M\), for each channel \(c\).

  • std (Sequence) – Sequence of standard deviations, \(\Sigma\), for each channel \(c\).

  • channel_dim (int) – Index of channel dimension. 0 for 'CTHW' tensors and ` for 'TCHW' tensors.

  • inplace (bool) – Whether to normalise the tensor without cloning or not.

Return type

Tensor

Returns

Channel-wise normalised tensor video, \(t'_c = \frac{t_c - M_c}{\Sigma_c}\)

time_to_channel

torchvideo.transforms.functional.time_to_channel(tensor)[source]

Reshape video tensor of shape \((C, T, H, W)\) into \((C \times T, H, W)\)

Parameters

tensor (Tensor) – Tensor video of size \((C, T, H, W)\)

Return type

Tensor

Returns

Tensor of shape \((C \times T, H, W)\)