torchvideo.transforms¶
This module contains video transforms similar to those found in
torchvision.transforms
specialised for image transformations. Like the transforms
from torchvision.transforms
you can chain together successive transforms using
torchvision.transforms.Compose
.
Contents
Target parameters¶
All transforms support a target parameter. Currently these don’t do anything, but allow you to implement transforms on targets as well as frames. At some point in future it is the intention that we’ll support transforms of things like masks, or allow you to plug your own target transforms into these classes.
Examples¶
Typically your transformation pipelines will be compose of a sequence of PIL video
transforms followed by a CollectFrames
transform and a
PILVideoToTensor
: transform.
import torchvideo.transforms as VT
import torchvision.transforms as IT
from torchvision.transforms import Compose
transform = Compose([
VT.CenterCropVideo((224, 224)), # (h, w)
VT.CollectFrames(),
VT.PILVideoToTensor()
])
Optical flow stored as flattened \((u, v)\) pairs like \((u_0, v_1, u_1, v_1, \ldots, u_n, v_n)\) that are then stacked into the channel dimension would be dealt with like so:
import torchvideo.transforms as VT
import torchvision.transforms as IT
from torchvision.transforms import Compose
transform = Compose([
VT.CenterCropVideo((224, 224)), # (h, w)
VT.CollectFrames(),
VT.PILVideoToTensor(),
VT.TimeToChannel()
])
Video Datatypes¶
torchvideo represents videos in a variety of formats:
PIL video: A list of a PIL Images, this is useful for applying image data augmentations
tensor video: A
torch.Tensor
of shape \((C, T, H, W)\) for feeding a network.NDArray video: A
numpy.ndarray
of shape either \((T, H, W, C)\) or \((C, T, H, W)\). The reason for the multiple channel shapes is that most loaders load in \((T, H, W, C)\) format, however tensors formatted for input into a network typically are formatted in \((C, T, H, W)\). Permuting the dimensions is a costly operation, so supporting both format allows for efficient implementation of transforms without have to invert the conversion from one format to the other.
Composing Transforms¶
Transforms can be composed with Compose
. This functions in exactly the same
way as torchvision’s implementation, however it also supports chaining transforms
that require, or optionally support, or don’t support a target parameter. It handles
the marshalling of targets around and into those transforms depending upon their
support allowing you to mix transforms defined in this library (all of which support
a target parameter) and those defined in other libraries.
Additionally, we provide a IdentityTransform
that has a nicer __repr__
suitable for use as a default transform in Compose
pipelines.
Compose¶
-
class
torchvideo.transforms.
Compose
(transforms)[source]¶ Bases:
object
Similar to
torchvision.transforms.transforms.Compose
except supporting transforms that take either a mandatory or optional target parameter in __call__. This facilitates chaining a mix of transforms: those that don’t support target parameters, those that do, and those that require them.
IdentityTransform¶
-
class
torchvideo.transforms.
IdentityTransform
[source]¶ Bases:
torchvideo.transforms.transforms.transform.StatelessTransform
Identity transformation that returns frames (and labels) unchanged. This is primarily of use when conditionally adding in transforms and you want to default to a transform that doesn’t do anything. Whilst you could just use an identity lambda this transform has a nicer repr that shows that no transform is taking place.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
-
Transforms on PIL Videos¶
These transforms all take an iterator/iterable of PIL.Image.Image
and produce
an iterator of PIL.Image.Image
. To materialize the iterator the you should
compose your sequence of PIL video transforms with CollectFrames
.
CenterCropVideo¶
-
class
torchvideo.transforms.
CenterCropVideo
(size)[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Crops the given video (composed of PIL Images) at the center of the frame.
- Parameters
size (sequence or int) – Desired output size of the crop. If size is an
int
instead of sequence like(h, w)
, a square crop(size, size)
is made.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
RandomCropVideo¶
-
class
torchvideo.transforms.
RandomCropVideo
(size, padding=None, pad_if_needed=False, fill=0, padding_mode='constant')[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Crop the given Video (composed of PIL Images) at a random location.
- Parameters
size (
Union
[Tuple
[int
,int
],int
]) – Desired output size of the crop. Ifsize
is an int instead of sequence like(h, w)
, a square crop(size, size)
is made.padding (
Union
[Tuple
[int
,int
,int
,int
],Tuple
[int
,int
],None
]) – Optional padding on each border of the image. Default isNone
, i.e no padding. If a sequence of length 4 is provided, it is used to pad left, top, right, bottom borders respectively. If a sequence of length 2 is provided, it is used to pad left/right, top/bottom borders, respectively.pad_if_needed (
bool
) – Whether to pad the image if smaller than the desired size to avoid raising an exception.fill (
int
) – Pixel fill value for constant fill. If a tuple of length 3, it is used to fill R, G, B channels respectively. This value is only used when thepadding_mode
is'constant'
.padding_mode (
str
) –Type of padding. Should be one of:
'constant'
,'edge'
,'reflect'
or'symmetric'
.'constant'
: pads with a constant value, this value is specified with fill.'edge'
: pads with the last value on the edge of the image.'reflect'
: pads with reflection of image (without repeating the last value on the edge) padding[1, 2, 3, 4]
with 2 elements on both sides in reflect mode will result in[3, 2, 1, 2, 3, 4, 3, 2]
.'symmetric'
: pads with reflection of image (repeating the last value on the edge) padding[1, 2, 3, 4]
with 2 elements on both sides in symmetric mode will result in[2, 1, 1, 2, 3, 4, 4, 3]
.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
RandomHorizontalFlipVideo¶
-
class
torchvideo.transforms.
RandomHorizontalFlipVideo
(p=0.5)[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Horizontally flip the given video (composed of PIL Images) randomly with a given probability \(p\).
- Parameters
p (float) – probability of the image being flipped.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
ResizeVideo¶
-
class
torchvideo.transforms.
ResizeVideo
(size, interpolation=2)[source]¶ Bases:
torchvideo.transforms.transforms.transform.StatelessTransform
Resize the input video (composed of PIL Images) to the given size.
- Parameters
size (sequence or int) – Desired output size. If size is a sequence like
(h, w)
, output size will be matched to this. If size is anint
, smaller edge of the image will be matched to this number. i.e, ifheight > width
, then image will be rescaled to(size * height / width, size)
.interpolation (int, optional) – Desired interpolation. Default is
PIL.Image.BILINEAR
(seePIL.Image.Image.resize()
for other options).
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
MultiScaleCropVideo¶
-
class
torchvideo.transforms.
MultiScaleCropVideo
(size, scales=(1, 0.875, 0.75, 0.66), max_distortion=1, fixed_crops=True, more_fixed_crops=True)[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Random crop the input video (composed of PIL Images) at one of the given scales or from a set of fixed crops, then resize to specified size.
- Parameters
size (sequence or int) – Desired output size. If size is an int instead of sequence like
(h, w)
, a square image(size, size)
is made.scales (sequence) – A sequence of floats between in the range \([0, 1]\) indicating the scale of the crop to be made.
max_distortion (int) – Integer between 0–
len(scales)
that controls aspect-ratio distortion. This parameters decides which scales will be combined together when creating crop boxes. A max distortion of0
means that the crop width/height have to be from the same scale, whereas a distortion of 1 means that the crop width/height can be from 1 scale before or ahead in thescales
sequence thereby stretching or squishing the frame.fixed_crops (bool) – Whether to use upper right, upper left, lower right, lower left and center crop positions as the list of candidate crop positions instead of those generated from
scales
andmax_distortion
.more_fixed_crops (bool) – Whether to add center left, center right, upper center, lower center, upper quarter left, upper quarter right, lower quarter left, lower quarter right crop positions to the list of candidate crop positions that are randomly selected.
fixed_crops
must be enabled to use this setting.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
RandomResizedCropVideo¶
-
class
torchvideo.transforms.
RandomResizedCropVideo
(size, scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=2)[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Crop the given video (composed of PIL Images) to random size and aspect ratio.
A crop of random scale (default: \([0.08, 1.0]\)) of the original size and a random scale (default: \([3/4, 4/3]\)) of the original aspect ratio is made. This crop is finally resized to given size. This is popularly used to train the Inception networks.
- Parameters
size (
Union
[Tuple
[int
,int
],int
]) – Desired output size. If size is an int instead of sequence like(h, w)
, a square image(size, size)
is made.scale (
Tuple
[float
,float
]) – range of size of the origin size cropped.ratio (
Tuple
[float
,float
]) – range of aspect ratio of the origin aspect ratio cropped.interpolation – Default:
PIL.Image.BILINEAR
(seePIL.Image.Image.resize()
for other options).
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
TimeApply¶
-
class
torchvideo.transforms.
TimeApply
(img_transform)[source]¶ Bases:
torchvideo.transforms.transforms.transform.StatelessTransform
Apply a PIL Image transform across time.
See torchvision.transforms for suitable deterministic transforms to use with meta-transform.
Warning
You should only use this with deterministic image transforms. Using a transform like
torchvision.transforms.RandomCrop
will randomly crop each individual frame at a different location producing a nonsensical video.-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
-
Transforms on Torch.*Tensor videos¶
These transform are applicable to torch.*Tensor videos only. The input to these transforms should be a tensor of shape \((C, T, H, W)\).
NormalizeVideo¶
-
class
torchvideo.transforms.
NormalizeVideo
(mean, std, channel_dim=0, inplace=False)[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Normalise
torch.*Tensor
\(t\) given mean: \(M = (\mu_1, \ldots, \mu_n)\) and std: \(\Sigma = (\sigma_1, \ldots, \sigma_n)\): \(t'_c = \frac{t_c - M_c}{\Sigma_c}\)- Parameters
mean (
Union
[Sequence
[Number
],Number
]) – Sequence of means for each channel, or a single mean applying to all channels.std (
Union
[Sequence
[Number
],Number
]) – Sequence of standard deviations for each channel, or a single standard deviation applying to all channels.channel_dim (
int
) – Index of channel dimension. 0 for'CTHW'
tensors and ` for'TCHW'
tensors.inplace (
bool
) – Whether or not to perform the operation in place without allocating a new tensor.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
TimeToChannel¶
-
class
torchvideo.transforms.
TimeToChannel
[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Combine time dimension into the channel dimension by reshaping video tensor of shape \((C, T, H, W)\) into \((C \times T, H, W)\)
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
-
Conversion transforms¶
These transforms are for converting between different video representations. Typically
your transformation pipeline will operate on iterators of PIL
images which
will then be aggregated by CollectFrames
and then coverted to a tensor via
PILVideoToTensor
.
CollectFrames¶
-
class
torchvideo.transforms.
CollectFrames
[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Collect frames from iterator into list.
Used at the end of a sequence of PIL video transformations.
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
-
PILVideoToTensor¶
-
class
torchvideo.transforms.
PILVideoToTensor
(rescale=True, ordering='CTHW')[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Convert a list of PIL Images to a tensor \((C, T, H, W)\) or \((T, C, H, W)\).
- Parameters
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
NDArrayToPILVideo¶
-
class
torchvideo.transforms.
NDArrayToPILVideo
(format='thwc')[source]¶ Bases:
torchvideo.transforms.transforms.transform.Transform
Convert
numpy.ndarray
of the format \((T, H, W, C)\) or \(( C, T, H, W)\) to a PIL video (an iterator of PIL images)- Parameters
format – dimensional layout of array, one of
"thwc"
or"cthw"
-
__call__
(frames, target=<class 'torchvideo.transforms.transforms.transform.empty_target'>)¶ Call self as a function.
Functional Transforms¶
Functional transforms give you fine-grained control of the transformation pipeline. As opposed to the transformations above, functional transforms don’t contain a random number generator for their parameters.
normalize¶
-
torchvideo.transforms.functional.
normalize
(tensor, mean, std, channel_dim=0, inplace=False)[source]¶ Channel-wise normalize a tensor video of shape \((C, T, H, W)\) with mean and standard deviation
See
NormalizeVideo
for more details.- Parameters
tensor (
Tensor
) – Tensor video of size \((C, T, H, W)\) to be normalized.mean (
Sequence
) – Sequence of means, \(M\), for each channel \(c\).std (
Sequence
) – Sequence of standard deviations, \(\Sigma\), for each channel \(c\).channel_dim (
int
) – Index of channel dimension. 0 for'CTHW'
tensors and ` for'TCHW'
tensors.inplace (
bool
) – Whether to normalise the tensor without cloning or not.
- Return type
- Returns
Channel-wise normalised tensor video, \(t'_c = \frac{t_c - M_c}{\Sigma_c}\)