We use the search space recommended by the BERT authors: We run a total of 18 trials, or full training runs, one for each combination of hyperparameters. where $\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). L regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph {not} the case for adaptive gradient algorithms, such as Adam. Interestingly, we see that weight_decay is the second most important hyperparameter, showing the importance of searching over more hyperparameters. For all the experiments on the proposed method, we use Stochastic Gradient Descent (SGD) with momentum 0.9 and weight decay 1 1 0 4. Weight Decay, or L 2 Regularization, is a regularization technique applied to the weights of a neural network. Instead of just discarding bad performing trials, we exploit good performing runs by copying their network weights and hyperparameters and then explore new hyperparameter configurations, while still continuing to train. All of the experiments below are run on a single AWS p3.16xlarge instance which has 8 NVIDIA V100 GPUs. Taken from "Fixing Weight Decay Regularization in Adam" by Ilya Loshchilov, Frank Hutter. In practice, it's recommended to fine-tune a ViT model that was pre-trained using a large, high-resolution dataset. Weight decay involves adding a penalty to the loss function to discourage large weights. We minimize a loss function compromising both the primary loss function and a penalty on the $L_{2}$ Norm of the weights: $$L_{new}\left(w\right) = L_{original}\left(w\right) + \lambda{w^{T}w}$$. In the original BERT implementation and in earlier versions of this repo, both LayerNorm.weight and LayerNorm.bias are decayed. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing.