Ross Wightman rwightman

## _timm_ra2_hparams.md

      
              5 files
            
          
              0 forks
            
          
                0 comments
              
            
              1 star
            
          
                rwightman
                / _timm_ra2_hparams.md
            
            
              Last active
              December 3, 2025 19:35
            
          
    Old RA2 hparams for some ResNet models, trained in the transition from SGD + RandAugment based 'RA' settings to RMSProp based 'RA2' with Mixup added
The yaml files are for 2x GPU distributed setup, so adjust accordingly for global batch size / LR equivalence.

  
## _README_MobileNetV4.md

      
              14 files
            
          
              0 forks
            
          
                1 comment
              
            
              6 stars
            
          
                rwightman
                / _README_MobileNetV4.md
            
            
              Last active
              December 3, 2025 19:35
            
              
                MobileNetV4 hparams
              
          
    MobileNetV4 Hparams

Included yaml files are timm train script configs for training MobileNetV4 models in timm (see on HF Hub: https://huggingface.co/collections/timm/mobilenetv4-pretrained-weights-6669c22cda4db4244def9637)
Note the # of GPUs, this needs to be taken into consideration for global batch size equivalence, and LR scaling.
Also note, some models have lr set to a non null value, this LR is used directly if set. Otherwise, it falls back to lr_base and the used rate is calculated based on lr_base_size and a sqrt scaling according to the global batch size.
Models with ix in the tag are using an alternative init for the MQA attention model projections, xavier (glorot) uniform instead of the efficientnet/mobilenet defaults. This seemed to improve stability of the hybrid models, allow a larger (closer to 1) beta2 for adam, otherwise beta2 on the adam, or the LR needed to be reduced to avoid instability with the hybrids.

  
## _README_ViT_SBB_Hparams.md

      
              16 files
            
          
              0 forks
            
          
                0 comments
              
            
              3 stars
            
          
                rwightman
                / _README_ViT_SBB_Hparams.md
            
            
              Last active
              December 18, 2025 06:24
            
              
                Searching for Better Vit Baselines Hparams
              
          
    ViT 'Searching For Better Baselines' Hparams

Included yaml files are timm train script configs for training timm SBB ViT exploration
See

https://huggingface.co/blog/rwightman/vit-sbb-imagenet-full
https://huggingface.co/collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19

Note the # of GPUs, this needs to be taken into consideration for global batch size equivalence, and LR scaling.

  
## _regnet_haparms.md

      
              8 files
            
          
              0 forks
            
          
                0 comments
              
            
              2 stars
            
          
                rwightman
                / _regnet_haparms.md
            
            
              Last active
              December 3, 2025 19:36
            
          
    Some hparams related to RegNets (and other nets) in TPU training series https://github.com/rwightman/pytorch-image-models/releases/tag/v0.1-tpu-weights
All models trained on x8 TPUs, so global batch == batch_size * 8
If in the weight name it says ra3 it means rmsproptf + mixup + cutmix + rand erasing + (usually) lr noise + rand-aug + head dropout + drop path (stochastic depth). Older ra2 scheme was very similar but no cutmix and rand-aug was always using normal sampling (mstd0.5 or mstd1.0) for rand-aug magnitude, where as ra3 is often (not always) using uniform sampling (mstd101).
Some weights were trained with sgd + grad clipping (cx in name where x is one of h, 1, 2, 3 ), h = amped up augreg.
I believe the 064 regnety was very close with both the ra3 and sgd approach, hparams I have kept were the sgd ones but I believe published weights were rmsproptf and edged out by a hair.

  
## _timm_hparams.md

      
              10 files
            
          
              0 forks
            
          
                0 comments
              
            
              3 stars
            
          
                rwightman
                / _timm_hparams.md
            
            
              Last active
              December 3, 2025 19:36
            
              
                Recent timm hparams...
              
          
    A variety of hparams used to train vit, convnext, vit-hybrids (maxvit, coatnet) recently in timm
All variations on the same theme (DeiT / Swin pretraining) but with different tweaks here and there.
These were all run on 4-8 GPU or TPU devices, they use --lr-base which rescales the LR automatically based on global batch size (relative to --lr-base-size) so adapting to different GPU counts will work well within a range, running at significanly lower or higher global batch sizes will require re-running a LR search.
More recntly, DeiT-III has shown to be a very compelling set of hparams for vit like models, but I've yet to do full runs myself, but theirs can be adapted to timm train scripts (3A aug added recently).
https://github.com/facebookresearch/deit/blob/main/README_revenge.md
To use the yaml files directly w/ timm train script.

  
## adaptive_gradient_clip.py
def unitwise_norm(x):
    if len(x.squeeze().shape) <= 1:
        dim = None
        keepdim = False
    elif len(x.shape) in (2, 3):
        dim = 1
        keepdim = True
    elif len(x.shape) == 4:
        dim = (1, 2, 3)      # pytorch convolution kernel is OIHW
        keepdim = True

## conv_deconv_vae.py
# Alec Radford, Indico, Kyle Kastner
# License: MIT
"""
Convolutional VAE in a single file.
Bringing in code from IndicoDataSolutions and Alec Radford (NewMu)
Additionally converted to use default conv2d interface instead of explicit cuDNN
"""
import theano
import theano.tensor as T
from theano.compat.python2x import OrderedDict

## SSLBuffer.cpp
#include "SSLBuffer.h"

SSLBuffer::SSLBuffer()
  :ssl(NULL)
  ,read_bio(NULL)
  ,write_bio(NULL)
  ,write_to_socket_callback(NULL)
  ,write_to_socket_callback_data(NULL)
  ,read_decrypted_callback(NULL)
  ,read_decrypted_callback_data(NULL)
	def unitwise_norm(x):
	if len(x.squeeze().shape) <= 1:
	dim = None
	keepdim = False
	elif len(x.shape) in (2, 3):
	dim = 1
	keepdim = True
	elif len(x.shape) == 4:
	dim = (1, 2, 3) # pytorch convolution kernel is OIHW
	keepdim = True
	# Alec Radford, Indico, Kyle Kastner
	# License: MIT
	"""
	Convolutional VAE in a single file.
	Bringing in code from IndicoDataSolutions and Alec Radford (NewMu)
	Additionally converted to use default conv2d interface instead of explicit cuDNN
	"""
	import theano
	import theano.tensor as T
	from theano.compat.python2x import OrderedDict
	#include "SSLBuffer.h"

	SSLBuffer::SSLBuffer()
	:ssl(NULL)
	,read_bio(NULL)
	,write_bio(NULL)
	,write_to_socket_callback(NULL)
	,write_to_socket_callback_data(NULL)
	,read_decrypted_callback(NULL)
	,read_decrypted_callback_data(NULL)