self training with noisy student improves imagenet classification

In our experiments, we also further scale up EfficientNet-B7 and obtain EfficientNet-L0, L1 and L2. On robustness test sets, it improves Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. A. Krizhevsky, I. Sutskever, and G. E. Hinton, Temporal ensembling for semi-supervised learning, Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks, Workshop on Challenges in Representation Learning, ICML, Certainty-driven consistency loss for semi-supervised learning, C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, Improving robustness without sacrificing accuracy with patch gaussian augmentation, Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, Smooth neighbors on teacher graphs for semi-supervised learning, L. Maale, C. K. Snderby, S. K. Snderby, and O. Winther, A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, Exploring the limits of weakly supervised pretraining, T. Miyato, S. Maeda, S. Ishii, and M. Koyama, Virtual adversarial training: a regularization method for supervised and semi-supervised learning, IEEE transactions on pattern analysis and machine intelligence, A. Najafi, S. Maeda, M. Koyama, and T. Miyato, Robustness to adversarial perturbations in learning from incomplete data, J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang, Robustness properties of facebooks resnext wsl models, Adversarial dropout for supervised and semi-supervised learning, Lessons from building acoustic models with a million hours of speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, Deep co-training for semi-supervised image recognition, I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He, Data distillation: towards omni-supervised learning, A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Proceedings of the AAAI Conference on Artificial Intelligence, B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. This work proposes a novel architectural unit, which is term the Squeeze-and-Excitation (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. Train a larger classifier on the combined set, adding noise (noisy student). We have also observed that using hard pseudo labels can achieve as good results or slightly better results when a larger teacher is used. We obtain unlabeled images from the JFT dataset [26, 11], which has around 300M images. Noisy Student Training is based on the self-training framework and trained with 4-simple steps: Train a classifier on labeled data (teacher). ; 2006)[book reviews], Semi-supervised deep learning with memory, Proceedings of the European Conference on Computer Vision (ECCV), Xception: deep learning with depthwise separable convolutions, K. Clark, M. Luong, C. D. Manning, and Q. V. Le, Semi-supervised sequence modeling with cross-view training, E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, AutoAugment: learning augmentation strategies from data, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, RandAugment: practical data augmentation with no separate search, Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, Good semi-supervised learning that requires a bad gan, T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, A. Galloway, A. Golubeva, T. Tanay, M. Moussa, and G. W. Taylor, R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow, I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, Semi-supervised learning by entropy minimization, Advances in neural information processing systems, K. Gu, B. Yang, J. Ngiam, Q. Computer Science - Computer Vision and Pattern Recognition. Here we show the evidence in Table 6, noise such as stochastic depth, dropout and data augmentation plays an important role in enabling the student model to perform better than the teacher. This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date. After testing our models robustness to common corruptions and perturbations, we also study its performance on adversarial perturbations. Use Git or checkout with SVN using the web URL. Le. 1ImageNetTeacher NetworkStudent Network 2T [JFT dataset] 3 [JFT dataset]ImageNetStudent Network 4Student Network1DropOut21 1S-TTSS equal-or-larger student model We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. . We do not tune these hyperparameters extensively since our method is highly robust to them. The results are shown in Figure 4 with the following observations: (1) Soft pseudo labels and hard pseudo labels can both lead to great improvements with in-domain unlabeled images i.e., high-confidence images. It is expensive and must be done with great care. Lastly, we will show the results of benchmarking our model on robustness datasets such as ImageNet-A, C and P and adversarial robustness. In the following, we will first describe experiment details to achieve our results. At the top-left image, the model without Noisy Student ignores the sea lions and mistakenly recognizes a buoy as a lighthouse, while the model with Noisy Student can recognize the sea lions. Their purpose is different from ours: to adapt a teacher model on one domain to another. Self-training with Noisy Student improves ImageNet classication Qizhe Xie 1, Minh-Thang Luong , Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University fqizhex, thangluong, qvlg@google.com, hovy@cmu.edu Abstract We present Noisy Student Training, a semi-supervised learning approach that works well even when . In our experiments, we use dropout[63], stochastic depth[29], data augmentation[14] to noise the student. Then we finetune the model with a larger resolution for 1.5 epochs on unaugmented labeled images. Learn more. Self-Training With Noisy Student Improves ImageNet Classification Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. Work fast with our official CLI. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. [^reference-9] [^reference-10] A critical insight was to . Overall, EfficientNets with Noisy Student provide a much better tradeoff between model size and accuracy when compared with prior works. Here we study if it is possible to improve performance on small models by using a larger teacher model, since small models are useful when there are constraints for model size and latency in real-world applications. The biggest gain is observed on ImageNet-A: our method achieves 3.5x higher accuracy on ImageNet-A, going from 16.6% of the previous state-of-the-art to 74.2% top-1 accuracy. In this section, we study the importance of noise and the effect of several noise methods used in our model. In contrast, changing architectures or training with weakly labeled data give modest gains in accuracy from 4.7% to 16.6%. In particular, we first perform normal training with a smaller resolution for 350 epochs. CLIP (Contrastive Language-Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning.The idea of zero-data learning dates back over a decade [^reference-8] but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. For this purpose, we use a much larger corpus of unlabeled images, where some images may not belong to any category in ImageNet. 27.8 to 16.1. ImageNet-A test set[25] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. Different types of. We used the version from [47], which filtered the validation set of ImageNet. Use a model to predict pseudo-labels on the filtered data: This is not an officially supported Google product. We then perform data filtering and balancing on this corpus. Papers With Code is a free resource with all data licensed under. After using the masks generated by teacher-SN, the classification performance improved by 0.2 of AC, 1.2 of SP, and 0.7 of AUC. ImageNet . We first report the validation set accuracy on the ImageNet 2012 ILSVRC challenge prediction task as commonly done in literature[35, 66, 23, 69] (see also [55]). As can be seen, our model with Noisy Student makes correct and consistent predictions as images undergone different perturbations while the model without Noisy Student flips predictions frequently. The results also confirm that vision models can benefit from Noisy Student even without iterative training. (Submitted on 11 Nov 2019) We present a simple self-training method that achieves 87.4% top-1 accuracy on ImageNet, which is 1.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Whether the model benefits from more unlabeled data depends on the capacity of the model since a small model can easily saturate, while a larger model can benefit from more data. Are you sure you want to create this branch? The main difference between our work and these works is that they directly optimize adversarial robustness on unlabeled data, whereas we show that self-training with Noisy Student improves robustness greatly even without directly optimizing robustness. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. . Hence we use soft pseudo labels for our experiments unless otherwise specified. We hypothesize that the improvement can be attributed to SGD, which introduces stochasticity into the training process. We then train a larger EfficientNet as a student model on the On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. This is an important difference between our work and prior works on teacher-student framework whose main goal is model compression. Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan, Billion-scale semi-supervised learning for image classification, Z. Yang, W. W. Cohen, and R. Salakhutdinov, Revisiting semi-supervised learning with graph embeddings, Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen, Semi-supervised qa with generative domain-adaptive nets, Unsupervised word sense disambiguation rivaling supervised methods, 33rd annual meeting of the association for computational linguistics, R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang, Adversarially robust generalization just requires more unlabeled data, X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, Proceedings of the IEEE international conference on computer vision, Making convolutional networks shift-invariant again, X. Zhang, Z. Li, C. Change Loy, and D. Lin, Polynet: a pursuit of structural diversity in very deep networks, X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, Proceedings of the 20th International conference on Machine learning (ICML-03), Semi-supervised learning literature survey, University of Wisconsin-Madison Department of Computer Sciences, B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, Learning transferable architectures for scalable image recognition, Architecture specifications for EfficientNet used in the paper. We sample 1.3M images in confidence intervals. On ImageNet-C, it reduces mean corruption error (mCE) from 45.7 to 31.2. Here we study how to effectively use out-of-domain data. Finally, the training time of EfficientNet-L2 is around 2.72 times the training time of EfficientNet-L1. Especially unlabeled images are plentiful and can be collected with ease. When data augmentation noise is used, the student must ensure that a translated image, for example, should have the same category with a non-translated image. By clicking accept or continuing to use the site, you agree to the terms outlined in our. There was a problem preparing your codespace, please try again. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Noisy Student (B7) means to use EfficientNet-B7 for both the student and the teacher. The score is normalized by AlexNets error rate so that corruptions with different difficulties lead to scores of a similar scale. We also list EfficientNet-B7 as a reference. As can be seen from the figure, our model with Noisy Student makes correct predictions for images under severe corruptions and perturbations such as snow, motion blur and fog, while the model without Noisy Student suffers greatly under these conditions. This is a recurring payment that will happen monthly, If you exceed more than 500 images, they will be charged at a rate of $5 per 500 images. Noisy Students performance improves with more unlabeled data. The Wilds 2.0 update is presented, which extends 8 of the 10 datasets in the Wilds benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment, and systematically benchmark state-of-the-art methods that leverage unlabeling data, including domain-invariant, self-training, and self-supervised methods. Code is available at https://github.com/google-research/noisystudent. Parthasarathi et al. It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. We then use the teacher model to generate pseudo labels on unlabeled images. Then, EfficientNet-L1 is scaled up from EfficientNet-L0 by increasing width. One might argue that the improvements from using noise can be resulted from preventing overfitting the pseudo labels on the unlabeled images. Scripts used for our ImageNet experiments: Similar scripts to run predictions on unlabeled data, filter and balance data and train using the filtered data. A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. Do imagenet classifiers generalize to imagenet? The mapping from the 200 classes to the original ImageNet classes are available online.222https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py.