melMass/image shapes.md

## image shapes.md

      
    Raw
  

              image shapes.md
            
          
    AI Framework Tensor Shape Reference Guide

This guide provides a comprehensive reference for tensor shapes across different AI frameworks and data types.
Important
These are common conventions, not enforced rules

Note

Batch dimension (B) is typically the first dimension when present
Some frameworks allow flexible dimension ordering through configuration
Shape conventions might vary based on specific functions or models within frameworks
Many frameworks support both channel-first and channel-last formats with configuration options


Common Notation


B: Batch size
C: Channels
T: Time steps / Sequence length
H: Height
W: Width
D: Depth
V: Vertices
F: Faces
S: Samples
Fr: Frames
M: Mel bands (for spectrograms)

Image Data


Framework
Image Shape
Batch Shape
Channel Ordering
Notes


NumPy
(H, W), (H, W, C)
(B, H, W, C)
RGB
C=1/3/4 for gray/RGB/RGBA


PyTorch
(C, H, W)
(B, C, H, W)
RGB
C=1/3/4 for gray/RGB/RGBA


TensorFlow
(H, W, C)
(B, H, W, C)
RGB
C=1/3/4 for gray/RGB/RGBA


Keras
(H, W, C)
(B, H, W, C)
RGB
Same as TF


OpenCV
(H, W), (H, W, C)
N/A
BGR
C=3 for BGR, grayscale is 2D


PIL
(H, W), (H, W, C)
N/A
RGB
size property is (W, H), but array is (H, W, C)


Matplotlib
(H, W), (H, W, C)
N/A
RGB
C=1/3/4 for gray/RGB/RGBA


Scikit-image
(H, W), (H, W, C)
N/A
RGB
C=1/3/4 for gray/RGB/RGBA


Audio Data


Framework
Raw Audio Shape
Spectrogram Shape
Notes


Librosa
(S,)
(M, Fr)
M=128 mel bands by default


Torchaudio
(C, S)
(C, M, Fr)
C=1 for mono, C=2 for stereo, M=128 mel bands by default


TensorFlow
(S,)
(Fr, F)
F=frequency bins (129 by default)


Soundfile
(S, C)
N/A
C=1 for mono, C=2 for stereo


3D Mesh Data


Framework
Vertices Shape
Faces Shape
Batch Shape
Additional Attributes
Notes


PyTorch3D
(V, 3)
(F, 3)
(B, V, 3)
Textures: (F, H, W, 3) or (V, 3)
Normals: (V, 3)
UV coords: (V, 2)
V = vertices, F = faces
Supports packed/padded representations


Open3D
(V, 3)
(F, 3)
N/A
Vertex normals: (V, 3)
Vertex colors: (V, 3)
Triangle normals: (F, 3)
Primarily for single mesh operations


Trimesh
(V, 3)
(F, 3)
N/A
Vertex normals: (V, 3)
Face normals: (F, 3)
UV coords: (V, 2)
Focuses on watertight meshes


Kaolin
(V, 3)
(F, 3)
(B, V, 3)
Face uvs: (F, 3, 2)
Vertex normals: (V, 3)
Face normals: (F, 3)
NVIDIA's 3D DL library


Video Data


Framework
Common Shape
Common Batch Shape
Notes


PyTorch
(C, Fr, H, W),
(B, C, Fr, H, W),
Flexible ordering, commonly channel-first for consistency with image processing.


(Fr, C, H, W)
(B, Fr, C, H, W)
Some models/datasets use frame-first convention


TensorFlow
(Fr, H, W, C),
(B, Fr, H, W, C),
Flexible ordering, commonly channel-last for consistency with image processing.


(H, W, C, Fr)
(B, H, W, C, Fr)
Some models use different ordering for specific architectures


Torchvision
(C, Fr, H, W)
(B, C, Fr, H, W)
Typically follows PyTorch's channel-first convention, but transformations can modify this


OpenCV
(H, W, C)
N/A
Returns individual frames in BGR (C=3). VideoCapture reads frame by frame


Point Cloud Data


Framework
Points Shape
Batch Shape
Notes


PyTorch3D
(N, 3)
(B, N, 3)
N = number of points


Open3D
(N, 3)
N/A
xyz coordinates


NumPy
(N, 3)
(B, N, 3)
Basic representation


Text/NLP Data


Framework
Text Shape
Batch Shape
Notes


PyTorch
(T,)
(B, T)
T = sequence length


TensorFlow
(T,)
(B, T)
-


HuggingFace
(T,)
(B, T)
Often includes attention masks


SpaCy
(T,)
N/A
Document objects
Framework	Image Shape	Batch Shape	Channel Ordering	Notes
NumPy	(H, W), (H, W, C)	(B, H, W, C)	RGB	C=1/3/4 for gray/RGB/RGBA
PyTorch	(C, H, W)	(B, C, H, W)	RGB	C=1/3/4 for gray/RGB/RGBA
TensorFlow	(H, W, C)	(B, H, W, C)	RGB	C=1/3/4 for gray/RGB/RGBA
Keras	(H, W, C)	(B, H, W, C)	RGB	Same as TF
OpenCV	(H, W), (H, W, C)	N/A	BGR	C=3 for BGR, grayscale is 2D
PIL	(H, W), (H, W, C)	N/A	RGB	size property is (W, H), but array is (H, W, C)
Matplotlib	(H, W), (H, W, C)	N/A	RGB	C=1/3/4 for gray/RGB/RGBA
Scikit-image	(H, W), (H, W, C)	N/A	RGB	C=1/3/4 for gray/RGB/RGBA
Framework	Raw Audio Shape	Spectrogram Shape	Notes
Librosa	(S,)	(M, Fr)	M=128 mel bands by default
Torchaudio	(C, S)	(C, M, Fr)	C=1 for mono, C=2 for stereo, M=128 mel bands by default
TensorFlow	(S,)	(Fr, F)	F=frequency bins (129 by default)
Soundfile	(S, C)	N/A	C=1 for mono, C=2 for stereo
Framework	Vertices Shape	Faces Shape	Batch Shape	Additional Attributes	Notes
PyTorch3D	(V, 3)	(F, 3)	(B, V, 3)	Textures: (F, H, W, 3) or (V, 3) Normals: (V, 3) UV coords: (V, 2)	V = vertices, F = faces Supports packed/padded representations
Open3D	(V, 3)	(F, 3)	N/A	Vertex normals: (V, 3) Vertex colors: (V, 3) Triangle normals: (F, 3)	Primarily for single mesh operations
Trimesh	(V, 3)	(F, 3)	N/A	Vertex normals: (V, 3) Face normals: (F, 3) UV coords: (V, 2)	Focuses on watertight meshes
Kaolin	(V, 3)	(F, 3)	(B, V, 3)	Face uvs: (F, 3, 2) Vertex normals: (V, 3) Face normals: (F, 3)	NVIDIA's 3D DL library
Framework	Common Shape	Common Batch Shape	Notes
PyTorch	(C, Fr, H, W),	(B, C, Fr, H, W),	Flexible ordering, commonly channel-first for consistency with image processing.
	(Fr, C, H, W)	(B, Fr, C, H, W)	Some models/datasets use frame-first convention
TensorFlow	(Fr, H, W, C),	(B, Fr, H, W, C),	Flexible ordering, commonly channel-last for consistency with image processing.
	(H, W, C, Fr)	(B, H, W, C, Fr)	Some models use different ordering for specific architectures
Torchvision	(C, Fr, H, W)	(B, C, Fr, H, W)	Typically follows PyTorch's channel-first convention, but transformations can modify this
OpenCV	(H, W, C)	N/A	Returns individual frames in BGR (C=3). VideoCapture reads frame by frame
Framework	Points Shape	Batch Shape	Notes
PyTorch3D	(N, 3)	(B, N, 3)	N = number of points
Open3D	(N, 3)	N/A	xyz coordinates
NumPy	(N, 3)	(B, N, 3)	Basic representation
Framework	Text Shape	Batch Shape	Notes
PyTorch	(T,)	(B, T)	T = sequence length
TensorFlow	(T,)	(B, T)	-
HuggingFace	(T,)	(B, T)	Often includes attention masks
SpaCy	(T,)	N/A	Document objects