self training with noisy student improves imagenet classification

Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le Description: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. 1ImageNetTeacher NetworkStudent Network 2T [JFT dataset] 3 [JFT dataset]ImageNetStudent Network 4Student Network1DropOut21 1S-TTSS equal-or-larger student model The best model in our experiments is a result of iterative training of teacher and student by putting back the student as the new teacher to generate new pseudo labels. Use a model to predict pseudo-labels on the filtered data: This is not an officially supported Google product. The Wilds 2.0 update is presented, which extends 8 of the 10 datasets in the Wilds benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment, and systematically benchmark state-of-the-art methods that leverage unlabeling data, including domain-invariant, self-training, and self-supervised methods. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. The inputs to the algorithm are both labeled and unlabeled images. They did not show significant improvements in terms of robustness on ImageNet-A, C and P as we did. Note that these adversarial robustness results are not directly comparable to prior works since we use a large input resolution of 800x800 and adversarial vulnerability can scale with the input dimension[17, 20, 19, 61]. Our main results are shown in Table1. The abundance of data on the internet is vast. Then, EfficientNet-L1 is scaled up from EfficientNet-L0 by increasing width. We iterate this process by putting back the student as the teacher. First, we run an EfficientNet-B0 trained on ImageNet[69]. Overall, EfficientNets with Noisy Student provide a much better tradeoff between model size and accuracy when compared with prior works. Proceedings of the eleventh annual conference on Computational learning theory, Proceedings of the IEEE conference on computer vision and pattern recognition, Empirical Methods in Natural Language Processing (EMNLP), Imagenet classification with deep convolutional neural networks, Domain adaptive transfer learning with specialist models, Thirty-Second AAAI Conference on Artificial Intelligence, Regularized evolution for image classifier architecture search, Inception-v4, inception-resnet and the impact of residual connections on learning. This shows that it is helpful to train a large model with high accuracy using Noisy Student when small models are needed for deployment. We use EfficientNets[69] as our baseline models because they provide better capacity for more data. When data augmentation noise is used, the student must ensure that a translated image, for example, should have the same category with a non-translated image. We call the method self-training with Noisy Student to emphasize the role that noise plays in the method and results. The score is normalized by AlexNets error rate so that corruptions with different difficulties lead to scores of a similar scale. As can be seen from Table 8, the performance stays similar when we reduce the data to 116 of the total data, which amounts to 8.1M images after duplicating. We start with the 130M unlabeled images and gradually reduce the number of images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Finally, we iterate the process by putting back the student as a teacher to generate new pseudo labels and train a new student. For smaller models, we set the batch size of unlabeled images to be the same as the batch size of labeled images. For example, with all noise removed, the accuracy drops from 84.9% to 84.3% in the case with 130M unlabeled images and drops from 83.9% to 83.2% in the case with 1.3M unlabeled images. We present a simple self-training method that achieves 87.4 This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models. About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features Press Copyright Contact us Creators . The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs. There was a problem preparing your codespace, please try again. Z. Yalniz, H. Jegou, K. Chen, M. Paluri, and D. Mahajan, Billion-scale semi-supervised learning for image classification, Z. Yang, W. W. Cohen, and R. Salakhutdinov, Revisiting semi-supervised learning with graph embeddings, Z. Yang, J. Hu, R. Salakhutdinov, and W. W. Cohen, Semi-supervised qa with generative domain-adaptive nets, Unsupervised word sense disambiguation rivaling supervised methods, 33rd annual meeting of the association for computational linguistics, R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang, Adversarially robust generalization just requires more unlabeled data, X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, Proceedings of the IEEE international conference on computer vision, Making convolutional networks shift-invariant again, X. Zhang, Z. Li, C. Change Loy, and D. Lin, Polynet: a pursuit of structural diversity in very deep networks, X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, Proceedings of the 20th International conference on Machine learning (ICML-03), Semi-supervised learning literature survey, University of Wisconsin-Madison Department of Computer Sciences, B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, Learning transferable architectures for scalable image recognition, Architecture specifications for EfficientNet used in the paper. Stochastic Depth is a simple yet ingenious idea to add noise to the model by bypassing the transformations through skip connections. We train our model using the self-training framework[59] which has three main steps: 1) train a teacher model on labeled images, 2) use the teacher to generate pseudo labels on unlabeled images, and 3) train a student model on the combination of labeled images and pseudo labeled images. In this work, we showed that it is possible to use unlabeled images to significantly advance both accuracy and robustness of state-of-the-art ImageNet models. on ImageNet, which is 1.0 Noisy Student Training is a semi-supervised learning approach. If nothing happens, download GitHub Desktop and try again. Specifically, we train the student model for 350 epochs for models larger than EfficientNet-B4, including EfficientNet-L0, L1 and L2 and train the student model for 700 epochs for smaller models. Then by using the improved B7 model as the teacher, we trained an EfficientNet-L0 student model. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. By clicking accept or continuing to use the site, you agree to the terms outlined in our. 3429-3440. . to use Codespaces. A tag already exists with the provided branch name. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. The biggest gain is observed on ImageNet-A: our method achieves 3.5x higher accuracy on ImageNet-A, going from 16.6% of the previous state-of-the-art to 74.2% top-1 accuracy. Our study shows that using unlabeled data improves accuracy and general robustness. 3.5B weakly labeled Instagram images. [68, 24, 55, 22]. We find that Noisy Student is better with an additional trick: data balancing. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. The main difference between Data Distillation and our method is that we use the noise to weaken the student, which is the opposite of their approach of strengthening the teacher by ensembling. Since a teacher models confidence on an image can be a good indicator of whether it is an out-of-domain image, we consider the high-confidence images as in-domain images and the low-confidence images as out-of-domain images. Due to duplications, there are only 81M unique images among these 130M images. Use, Smithsonian Prior works on weakly-supervised learning require billions of weakly labeled data to improve state-of-the-art ImageNet models. https://arxiv.org/abs/1911.04252. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Code for Noisy Student Training. , have shown that computer vision models lack robustness. After testing our models robustness to common corruptions and perturbations, we also study its performance on adversarial perturbations. If nothing happens, download GitHub Desktop and try again. In other words, using Noisy Student makes a much larger impact to the accuracy than changing the architecture. For simplicity, we experiment with using 1128,164,132,116,14 of the whole data by uniformly sampling images from the the unlabeled set though taking the images with highest confidence leads to better results. Flip probability is the probability that the model changes top-1 prediction for different perturbations. An important contribution of our work was to show that Noisy Student can potentially help addressing the lack of robustness in computer vision models. Our procedure went as follows. - : self-training_with_noisy_student_improves_imagenet_classification The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. For instance, on ImageNet-1k, Layer Grafted Pre-training yields 65.5% Top-1 accuracy in terms of 1% few-shot learning with ViT-B/16, which improves MIM and CL baselines by 14.4% and 2.1% with no bells and whistles. On robustness test sets, it improves and surprising gains on robustness and adversarial benchmarks. In all previous experiments, the students capacity is as large as or larger than the capacity of the teacher model. However, in the case with 130M unlabeled images, with noise function removed, the performance is still improved to 84.3% from 84.0% when compared to the supervised baseline. Soft pseudo labels lead to better performance for low confidence data. The main use case of knowledge distillation is model compression by making the student model smaller. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. This work systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and shows that their success on WILDS is limited. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. A. Krizhevsky, I. Sutskever, and G. E. Hinton, Temporal ensembling for semi-supervised learning, Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks, Workshop on Challenges in Representation Learning, ICML, Certainty-driven consistency loss for semi-supervised learning, C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, Improving robustness without sacrificing accuracy with patch gaussian augmentation, Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, Smooth neighbors on teacher graphs for semi-supervised learning, L. Maale, C. K. Snderby, S. K. Snderby, and O. Winther, A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, Exploring the limits of weakly supervised pretraining, T. Miyato, S. Maeda, S. Ishii, and M. Koyama, Virtual adversarial training: a regularization method for supervised and semi-supervised learning, IEEE transactions on pattern analysis and machine intelligence, A. Najafi, S. Maeda, M. Koyama, and T. Miyato, Robustness to adversarial perturbations in learning from incomplete data, J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang, Robustness properties of facebooks resnext wsl models, Adversarial dropout for supervised and semi-supervised learning, Lessons from building acoustic models with a million hours of speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, Deep co-training for semi-supervised image recognition, I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He, Data distillation: towards omni-supervised learning, A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Proceedings of the AAAI Conference on Artificial Intelligence, B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. For example, without Noisy Student, the model predicts bullfrog for the image shown on the left of the second row, which might be resulted from the black lotus leaf on the water. Their framework is highly optimized for videos, e.g., prediction on which frame to use in a video, which is not as general as our work. We use the labeled images to train a teacher model using the standard cross entropy loss. We apply RandAugment to all EfficientNet baselines, leading to more competitive baselines. In our experiments, we observe that soft pseudo labels are usually more stable and lead to faster convergence, especially when the teacher model has low accuracy. Noisy Student (B7, L2) means to use EfficientNet-B7 as the student and use our best model with 87.4% accuracy as the teacher model. See ImageNet images and use it as a teacher to generate pseudo labels on 300M Self-training with Noisy Student improves ImageNet classification. We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Works based on pseudo label[37, 31, 60, 1] are similar to self-training, but also suffers the same problem with consistency training, since it relies on a model being trained instead of a converged model with high accuracy to generate pseudo labels. A tag already exists with the provided branch name. We used the version from [47], which filtered the validation set of ImageNet. However, manually annotating organs from CT scans is time . (or is it just me), Smithsonian Privacy Self-training was previously used to improve ResNet-50 from 76.4% to 81.2% top-1 accuracy[76] which is still far from the state-of-the-art accuracy. These works constrain model predictions to be invariant to noise injected to the input, hidden states or model parameters. Noisy Student Training is a semi-supervised training method which achieves 88.4% top-1 accuracy on ImageNet Here we show an implementation of Noisy Student Training on SVHN, which boosts the performance of a On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. Self-Training Noisy Student " " Self-Training . For labeled images, we use a batch size of 2048 by default and reduce the batch size when we could not fit the model into the memory.
Valid But Not Reliable Example, Kagat Ng Bubuyog Nakakamatay Ba, Richard Simmons Net Worth 2020, Kucoin Perpetual Futures, Use Others For Own Gain 12 Crossword Clue, Articles S