Selectivity Drives Productivity:
Efficient Dataset Pruning for Enhanced Transfer Learning

Yihua Zhang1,* Yimeng Zhang1,* Aochuan Chen1,* Jinghan Jia1, Jiancheng Liu1,
Gaowen Liu2, Mingyi Hong3, Shiyu Chang4, Sijia Liu1


1Michigan State University, 2Cisco Research, 3University of Minnesota, Twin City, 4UC Santa Barbara
* Equal Contribution

Published at NeurIPS 2023


Learn More

Overview

[TL;DR] We develop two simple but effective dataset pruning methods for transfer learning covering both supervised and self-supervised settings, and achieve lossless performance on various downstream tasks.

Dataset Pruning for Transfer Learning

Massive data is often considered essential for deep learning applications, but it also incurs significant computational and infrastructural costs. Therefore, dataset pruning has emerged as an effective way to improve data efficiency by identifying and removing redundant training samples without sacrificing performance. In this work, we aim to address the problem of dataset pruning for transfer learning. Ideally, a all-round dataset pruning method for transfer learning should contain the following good properties:

  • Lossless performance: pretraining on the pruned source dataset should lead to lossless or even improved performance on the finetuning task.
  • High efficiency: the pruning operation itself should be efficient to perform;
  • Broad applicability: the proposed method should be applicable to various pretraining settings, such as supervised and unsupervised pretraining;

To our best knowledge, the problem of dataset pruning for transfer learning remains open, as previous studies have primarily addressed dataset pruning and transfer learning as separate problems.


By contrast, we establish a unified viewpoint to integrate DP with transfer learning and find that existing DP methods are not suitable for the transfer learning paradigm. We then propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings respectively, by revisiting the DP problem through the lens of source-target domain mapping. we demonstrate the effectiveness of our approach on numerous transfer learning tasks. We show that source data classes can be pruned by up to 40% ∼ 80% without sacrificing downstream performance, resulting in a significant 2 ∼ 5× speed-up during the pretraining stage.

Motivation and Preliminary Study

Why do we perform dataset pruning on the source (pretraining) dataset?
Do the existing dataset pruning methods work for transfer learning?

Not All Source Classes Are Necessary


Some source data could make a harmful influence in the downstream performance.

Removing specific source classes can improve transfer learning.

Conventional Dataset Pruning lacks effectiveness on Transfer Learning.


In transfer learning, conventional SOTA dataset pruning methods do NOT yield significant performance improvement over random pruning

It is crucial to develop an efficient and effective DP method specifically tailored for transfer learning.


Dataset Pruning for Transfer Learning
Label Mapping & Feature Mapping

This paper proposes two simple but powerful and efficient methods for supervised and unsupervised pretraining.

Label Mapping for Supervised Pretraining

  • 01 Rational behind our design.

    Source data similar to downstream data intend to contribute more during the transfer process. How to quantitatively measure the relevance between the source and downstream data becomes a key.

  • To quantitatively measure the relevance between the source and downstream data, the dataset pruning for transfer learning can be viewed as a “voting” process, each downstream training data can vote for its most similar/relevant source training class.

  • For supervised pretrained classification model, the voting process is as simple as directly feeding the downstream training data into a pretrained surrogate model. The source classes that receives the most downstream training data are considered the most "relevant" one. See the right figure for illustration.

    The pretrained surrogate model can be as small as less than 1% the size of the model we aim to train. The superior performance on the source dataset of the surrogate model is also not necessary in the label mapping process.

Feature Mapping for Unsupervised Pretraining

  • One key difference of unsupervised learning from the supervised one lies in the absent data label or class information. We need to cluster the source data first, and the data cluster will serve as the minimum pruning unit, as well as the "voting" candidate.

  • 02 Feature mapping through a small surrogate model.

    Unlike the supervised pretraining, the small surrogate model can not directly voting through classification. Here, the data relevance is obtained by calculating the feature distance bewteen different data.

    For each downstream training data, it votes to the source data cluster with the closest data centroid in the feature space of the pretrained surrogate model.

Experiment Highlights

You can click the title of each experiment to collapse the item.
Please go to Full Experiment Results for more results and demonstrations.
  • LM/FM prunes the source dataset to large ratios without performance loss in downstream tasks.

    This figure presents the downstream accuracy of transfer learning vs. different pruning ratios. Here DP is performed using the surrogate model (RN-18) on ImageNet for 8 downstream tasks. We also present the downstream performance without pruning the source dataset (No Prune) as a reference for winning subsets. As we can see, both LM and FM significantly outperform the baselines by a substantial margin. This highlights the effectiveness of our proposed methods in achieving substantial dataset pruning without hurting downstream performance.

  • LM/FM improves transfer learning accuracy by identifying ‘winning subsets’

    This table provides a summary of the pruning ratios achieved by the best winning subsets identified using different DP methods for all 8 downstream datasets. Both LM and FM methods successfully remove more than 45% of the source classes without downstream performance drop. In contrast, all the baselines experience significant performance degradation when the pruning ratio exceeds 40%.

  • FM demonstrates superior performance in the unsupervised setting.

    This table highlights the effectiveness of FM-based DP in the self-supervised pretraining setup for three representative downstream tasks. The transfer learning accuracy achieved by using FM consistently outperforms baselines in the self-supervised pretraining paradigm. FM can identify winning subsets for transfer learning even in the challenging regime of large pruning ratios, ranging from 50% to 80%.

  • DP enhances the efficiency of source pretraining.

    This table displays the computation time required to obtain the pretrained source model using LM at different pruning ratios. The reported time consumption includes the entire pipeline, encompassing surrogate model training (RN18), DP process, and source model training (RN101) on the pruned ImageNet dataset. The runtime cost of the conventional transfer learning on the full ImageNet dataset for RN-101 is also listed as a reference. As we can see, DP enjoys high efficiency merit of source training. Taking the 5.4 hours required for source training on the full ImageNet dataset as a reference, LM-enabled 20% pruned ImageNet achieves a 15% reduction in training time. Moreover, the efficiency advantage increases to 76% when the pruning ratio reaches 80% and these computational benefits do not sacrifice transfer learning accuracy at all.

Full Experiment Results

Please use the tab to select the result of your interest.
Plase click on the image to zoom in for more details and analysis.
  • [Main]: experiment highlights to demonstrate the superior performance of the proposed methods;
  • [More]: additional experiments on abundant transfer learning settings;
  • [Ablation]: various ablation studies;
  • [Visualization]: visualizations and analyses to support the quantitative results.
    • All
    • Main
    • More
    • Ablation
    • Visualization

    Main Results on Supervised Pretraining

    In this experiment, we present the main results of LM/FM with 8 downstream datasets in the supervised setting.

    Winning Subsets

    In this experiment, we collect the statistics of the Winning subsets of different methods in supervised pretraining setting and show the superiority of LM/FM.

    Main Results on Unsupervised Pretraining

    In this experiment, we demonstrate the superior performance of FM in the unsupervised setting.

    Loss Landscape Flatness Analysis

    In this experiment, we show the models pretrained on the source subsets found by our method enjoys the best loss landscape flatness.

    Feature Distribution Visualization and Analysis

    In this experiment, we visualize and examine the top selected classes by FM and their image examples. We show the selected data classes assemble the downstream data the most.

    Explanation by Example (Part I)

    In this experiment, we visualize and examine the top selected classes by FM and their image examples. We show the selected data classes assemble the downstream data the most.

    Explanation by Example (Part II)

    In this experiment, we visualize and examine the top selected classes by FM and their image examples. We show the selected data classes assemble the downstream data the most.

    Explanation by Example (Part III)

    In this experiment, we visualize and examine the top selected classes by FM and their image examples. We show the selected data classes assemble the downstream data the most.

    Efficiency Improvement

    In this experiment, we show the time consumption deduction after using label mapping and feature mapping.

    Applying LM on Adversarial Pretraining.

    In this experiments, we demonstrate the applicability of our method in the setting of adversarial transfer learning.

    Experiments on ViT

    In this experiment, we demonstrate our method can be applied to more model architectures, such as ViTs.

    Experiments on Unsupervised Learning with MoCov3

    In this experiment, we demonstrate our method can be applied on more advanced unsupervised pretraining method, such as MoCov3.

    Experiments in the Multi-Task Setting

    In this experiments, we present the result of our method in the multi-task setting, namely pruning the source datasets considering different downstream tasks.

    Experiments with Data Bias

    In this experiment, we demonstrate the effectiveness of our method in the setting with data biases in the source datasets. Our method can effectively eliminate data biases.

    Experiments on Few-Shot Setting

    Results on more transfer settings.

    Results with Different Surrogate Models

    In this experiment, we show our methods can tolerate surrogate models much smaller than the real target model architecture.

    Results with Different Surrogate Model Sizes

    In this experiment, we show our methods can tolerate surrogate models much smaller than the real target model architecture.

    Results with the Reversly Selected Training Data

    In this experiment, we show the data selected by label/feature mapping are indeed valuable, compared to those with the least votes.

    Results of Feature Mapping with Different Cluster Numbers

    In this experiment, we show that feature mapping is not sensitive to the choice of cluster numbers.

    Yihua Zhang

    Co-First Author

    Graduate Student @OPTML-Group

    Yimeng Zhang

    Co-First Author

    Graduate Student @OPTML-Group

    Aochuan Chen

    Co-First Author

    Graduate Student @OPTML-Group

    Jinghan Jia

    Graduate Student @OPTML-Group

    Jiancheng Liu

    Graduate Student @OPTML-Group

    Gaowen Liu

    Cisco Research

    Mingyi Hong

    Associate Professor @UMN

    Shiyu Chang

    Assistant Professor @UCSB

    Sijia Liu

    Assistant Professor @MSU

    PI @OPTML-Group