51 KiB
51 KiB
awesome-metric-learning
😎 Awesome list about practical Metric Learning and its applications
Motivation 🤓
At Qdrant, we have one goal: make metric learning more practical. This listing is in line with this purpose, and we aim at providing a concise yet useful list of awesomeness around metric
learning. It is intended to be inspirational for productivity rather than serve as a full bibliography.
If you find it useful or like it in some other way, you may want to join our Discord server, where we are running a paper reading club on metric learning.
Contributing 🤩
If you want to contribute to this project, but don't know how, you may want to check out the contributing guide (/CONTRIBUTING.md). It's easy! 😌
Surveys 📖
▐ It has proceeding guides for supervised (http://contrib.scikit-learn.org/metric-learn/supervised.html), weakly supervised
▐ (http://contrib.scikit-learn.org/metric-learn/weakly_supervised.html) and unsupervised (http://contrib.scikit-learn.org/metric-learn/unsupervised.html) metric learning algorithms in
▐ metric_learn (http://contrib.scikit-learn.org/metric-learn/metric_learn.html) package.
- A comprehensive
study for newcomers.
▐ Factors such as sampling strategies, distance metrics, and network structures are systematically analyzed by comparing the quantitative results of the methods.
▐ It discusses the need for metric learning, old and state-of-the-art approaches, and some real-world use cases.
Applications 🎮
▐ CLIP offers state-of-the-art zero-shot image classification and image retrieval with a natural language query. See demo
▐ (https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb).
▐ This work achieves zero-shot classification and cross-modal audio retrieval from natural language queries.
▐ It is an open-class object detector to detect any label encoded by CLIP without finetuning. See demo (https://huggingface.co/spaces/akhaliq/Detic).
▐ TensorFlow Hub offers a collection of pretrained models from the paper Large Dual Encoders Are Generalizable Retrievers (https://arxiv.org/abs/2112.07899).
▐ GTR models are first initialized from a pre-trained T5 checkpoint. They are then further pre-trained with a set of community question-answer pairs. Finally, they are fine-tuned on the MS
▐ Marco dataset.
▐ The two encoders are shared so the GTR model functions as a single text encoder. The input is variable-length English text and the output is a 768-dimensional vector.
▐ The method and pretrained models found in Flair go beyond zero-shot sequence classification and offers zero-shot span tagging abilities for tasks such as named entity recognition and part
▐ of speech tagging.
▐ It leverages HuggingFace Transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. It supports
▐ guided, (semi-) supervised, and dynamic topic modeling beautiful visualizations.
▐ Identification of substances based on spectral analysis plays a vital role in forensic science. Similarly, the material identification process is of paramount importance for malfunction
▐ reasoning in manufacturing sectors and materials research.
▐ This models enables to identify materials with deep metric learning applied to X-Ray Diffraction (XRD) spectrum. Read this post
▐ (https://towardsdatascience.com/automatic-spectral-identification-using-deep-metric-learning-with-1d-regnet-and-adacos-8b7fb36f2d5f) for more background.
▐ Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic
▐ concepts and semantics. The repository provides the pretrained models and source code for Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus
▐ (https://arxiv.org/abs/2201.11313), where they apply several tricks to achieve this.
▐ State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems because it is quite challenging to represent items
▐ with different feature spaces jointly. To tackle this problem, they propose a kernel-based neural network, namely deep unified representation (DURation) for heterogeneous recommendation, to
▐ jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. See paper (https://arxiv.org/abs/2201.05861).
▐ It provides the implementation of Item2Vec: Neural Item Embedding for Collaborative Filtering (https://arxiv.org/abs/1603.04259), wrapped as a sklearn estimator compatible with GridSearchCV
▐ and BayesSearchCV for hyperparameter tuning.
▐ You can search for the overall closest fit, or choose to focus matching genre, mood, or instrumentation.
▐ It searches phrase-level answers to your questions in real-time or retrieves passages for downstream tasks. Check out demo (http://densephrases.korea.ac.kr/), or see paper
▐ (https://arxiv.org/abs/2109.08133).
▐ Instead of leveraging NLI/XNLI, they make use of the text encoder of the CLIP model, concluding from casual experiments that this sometimes gives better accuracy than NLI-based models.
▐ Application of the SimCLR method to musical data with out-of-domain generalization in million-scale music classification. See demo
▐ (https://spijkervet.github.io/CLMR/examples/clmr-onnxruntime-web/) or paper (https://arxiv.org/abs/2103.09410).
Case Studies ✍️
Libraries 🧰
▐ Quaterion is a framework for fine-tuning similarity learning models. The framework closes the "last mile" problem in training models for semantic search, recommendations, anomaly detection,
▐ extreme classification, matching engines, e.t.c. It is designed to combine the performance of pre-trained models with specialization for the custom task while avoiding slow and costly
▐ training.
- A library for
sentence-level embeddings.
▐ Developed on top of the well-known Transformers (https://github.com/huggingface/transformers) library, it provides an easy way to finetune Transformer-based models to obtain sequence-level
▐ embeddings.
▐ The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase
▐ identification.
- A metric learning library in
TensorFlow with a Keras-like API.
▐ It provides support for self-supervised contrastive learning and state-of-the-art methods such as SimCLR, SimSian, and Barlow Twins.
▐ A PyTorch library to train and inference with contextually-keyed word vectors augmented with part-of-speech tags to achieve multi-word queries.
▐ A PyTorch library to efficiently train self-supervised computer vision models with state-of-the-art techniques such as SimCLR, SimSian, Barlow Twins, BYOL, among others.
▐ A library that helps you benchmark pretrained and custom embedding models on tens of datasets and tasks with ease.
- A Python implementation of a number of popular
recommender algorithms.
▐ It supports incorporating user and item features to the traditional matrix factorization. It represents users and items as a sum of the latent representations of their features, thus
▐ achieving a better generalization.
▐ It provides efficient multicore and memory-independent implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA),
▐ Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec.
▐ It provides implementations of algorithms such as KNN, LFM, SLIM, NeuMF, FM, DeepFM, VAE and so on, in order to ensure fair comparison of recommender system benchmarks.
Tools ⚒️
▐ It supports UMAP, T-SNE, PCA, or custom techniques to analyze embeddings of encoders.
▐ It allows you to visualize the embedding space selecting explicitly the axis through algebraic formulas on the embeddings (like king-man+woman) and highlight specific items in the embedding
▐ space. It also supports implicit axes via PCA and t-SNE. See paper (https://arxiv.org/abs/1905.12099).
Approximate Nearest Neighbors ⚡
▐ It provides benchmarking of 20+ ANN algorithms on nine standard datasets with support to bring your dataset. (Medium Post
▐ (https://medium.com/towards-artificial-intelligence/how-to-choose-the-best-nearest-neighbors-algorithm-8d75d42b16ab?sk=889bc0006f5ff773e3a30fa283d91ee7))
▐ It is not the fastest ANN algorithm but achieves memory efficiency thanks to various quantization and indexing methods such as IVF, PQ, and IVF-PQ. (Tutorial
▐ (https://www.pinecone.io/learn/faiss-tutorial/))
▐ It is still one of the fastest ANN algorithms out there, requiring relatively a higher memory usage. (Paper: Efficient and robust approximate nearest neighbor search using Hierarchical
▐ Navigable Small World graphs (https://arxiv.org/abs/1603.09320))
▐ Paper: Accelerating Large-Scale Inference with Anisotropic Vector Quantization (https://arxiv.org/abs/1908.10396)
Papers 🔬
Dimensionality Reduction by
Learning an Invariant Mapping
▐ Published by Yann Le Cun et al. (2005), its main focus was on dimensionality reduction. However, the method proposed has excellent properties for metric learning such as preserving
▐ neighbourhood relationships and generalization to unseen data, and it has extensive applications with a great number of variations ever since. It is advised that you read this great post
▐ (https://medium.com/@maksym.bekuzarov/losses-explained-contrastive-loss-f8f57fe32246) to better understand its importance for metric learning.
▐ The paper introduces Triplet Loss, which can be seen as the "ImageNet moment" for deep metric learning. It is still one of the state-of-the-art methods and has a great number of
▐ applications in almost any data modality.
- A novel loss function
with better properties.
▐ It provides scale invariance, robustness against feature variance, and better convergence than Contrastive and Triplet Loss.
▐ Supervised metric learning without pairs or triplets.
▐ Although it is originally designed for the face recognition task, this loss function achieves state-of-the-art results in many other metric learning problems with a simpler and faster data
▐ feeding. It is also robust against unclean and unbalanced data when modified with sub-centers and a dynamic margin.
VICReg: Variance-Invariance-Covariance Regularization for
Self-Supervised Learning
▐ The paper introduces a method that explicitly avoids the collapse problem in high dimensions with a simple regularization term on the variance of the embeddings along each dimension
▐ individually. This new term can be incorporated into other methods to stabilize the training and performance improvements.
▐ The paper proposes using the mean centroid representation during training and retrieval for robustness against outliers and more stable features. It further reduces retrieval time and
▐ storage requirements, making it suitable for production deployments.
▐ It demonstrates among other things that
▐ - composition of data augmentations plays a critical role - Random Crop + Random Color distortion provides the best downstream classifier accuracy,
▐ - introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations,
▐ - and Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
▐ They also incorporates annotated pairs from natural language inference datasets into their contrastive learning framework in a supervised setting, showing that contrastive learning
▐ objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
▐ Mining informative negative instances are of central importance to deep metric learning (DML), however this task is intrinsically limited by mini-batch training, where only a mini-batch of
▐ instances is accessible at each iteration. In this paper, we identify a "slow drift" phenomena by observing that the embedding features drift exceptionally slow even as the model parameters
▐ are updating throughout the training process. This suggests that the features of instances computed at preceding iterations can be used to considerably approximate their features extracted
▐ by the current model.
Datasets ℹ️
▐ Practitioners can use any labeled or unlabelled data for metric learning with an appropriate method chosen. However, some datasets are particularly important in the literature for
▐ benchmarking or other ways, and we list them in this section.
- The Stanford Natural Language Inference Corpus,
serving as a useful benchmark.
▐ The dataset contains pairs of sentences labeled as contradiction, entailment, and neutral regarding semantic relationships. Useful to train semantic search models in metric learning.
▐ Modeled on the SNLI corpus, the dataset contains sentence pairs from various genres of spoken and written text, and it also offers a distinctive cross-genre generalization evaluation.
▐ Shared as a part of a Kaggle competition by Google, this dataset is more diverse and thus more interesting than the first version.
▐ The dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
▐ The dataset is published along with "Deep Metric Learning via Lifted Structured Feature Embedding" (https://github.com/rksltnl/Deep-Metric-Learning-CVPR16) paper.
▐ The dataset is published along with "The 2021 Image Similarity Dataset and Challenge" (http://arxiv.org/abs/2106.09672) paper.
😎 Awesome list about practical Metric Learning and its applications
Motivation 🤓
At Qdrant, we have one goal: make metric learning more practical. This listing is in line with this purpose, and we aim at providing a concise yet useful list of awesomeness around metric
learning. It is intended to be inspirational for productivity rather than serve as a full bibliography.
If you find it useful or like it in some other way, you may want to join our Discord server, where we are running a paper reading club on metric learning.
Contributing 🤩
If you want to contribute to this project, but don't know how, you may want to check out the contributing guide (/CONTRIBUTING.md). It's easy! 😌
Surveys 📖
▐ It has proceeding guides for supervised (http://contrib.scikit-learn.org/metric-learn/supervised.html), weakly supervised
▐ (http://contrib.scikit-learn.org/metric-learn/weakly_supervised.html) and unsupervised (http://contrib.scikit-learn.org/metric-learn/unsupervised.html) metric learning algorithms in
▐ metric_learn (http://contrib.scikit-learn.org/metric-learn/metric_learn.html) package.
- A comprehensive
study for newcomers.
▐ Factors such as sampling strategies, distance metrics, and network structures are systematically analyzed by comparing the quantitative results of the methods.
▐ It discusses the need for metric learning, old and state-of-the-art approaches, and some real-world use cases.
Applications 🎮
▐ CLIP offers state-of-the-art zero-shot image classification and image retrieval with a natural language query. See demo
▐ (https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb).
▐ This work achieves zero-shot classification and cross-modal audio retrieval from natural language queries.
▐ It is an open-class object detector to detect any label encoded by CLIP without finetuning. See demo (https://huggingface.co/spaces/akhaliq/Detic).
▐ TensorFlow Hub offers a collection of pretrained models from the paper Large Dual Encoders Are Generalizable Retrievers (https://arxiv.org/abs/2112.07899).
▐ GTR models are first initialized from a pre-trained T5 checkpoint. They are then further pre-trained with a set of community question-answer pairs. Finally, they are fine-tuned on the MS
▐ Marco dataset.
▐ The two encoders are shared so the GTR model functions as a single text encoder. The input is variable-length English text and the output is a 768-dimensional vector.
▐ The method and pretrained models found in Flair go beyond zero-shot sequence classification and offers zero-shot span tagging abilities for tasks such as named entity recognition and part
▐ of speech tagging.
▐ It leverages HuggingFace Transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics while keeping important words in the topic descriptions. It supports
▐ guided, (semi-) supervised, and dynamic topic modeling beautiful visualizations.
▐ Identification of substances based on spectral analysis plays a vital role in forensic science. Similarly, the material identification process is of paramount importance for malfunction
▐ reasoning in manufacturing sectors and materials research.
▐ This models enables to identify materials with deep metric learning applied to X-Ray Diffraction (XRD) spectrum. Read this post
▐ (https://towardsdatascience.com/automatic-spectral-identification-using-deep-metric-learning-with-1d-regnet-and-adacos-8b7fb36f2d5f) for more background.
▐ Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic
▐ concepts and semantics. The repository provides the pretrained models and source code for Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus
▐ (https://arxiv.org/abs/2201.11313), where they apply several tricks to achieve this.
▐ State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems because it is quite challenging to represent items
▐ with different feature spaces jointly. To tackle this problem, they propose a kernel-based neural network, namely deep unified representation (DURation) for heterogeneous recommendation, to
▐ jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. See paper (https://arxiv.org/abs/2201.05861).
▐ It provides the implementation of Item2Vec: Neural Item Embedding for Collaborative Filtering (https://arxiv.org/abs/1603.04259), wrapped as a sklearn estimator compatible with GridSearchCV
▐ and BayesSearchCV for hyperparameter tuning.
▐ You can search for the overall closest fit, or choose to focus matching genre, mood, or instrumentation.
▐ It searches phrase-level answers to your questions in real-time or retrieves passages for downstream tasks. Check out demo (http://densephrases.korea.ac.kr/), or see paper
▐ (https://arxiv.org/abs/2109.08133).
▐ Instead of leveraging NLI/XNLI, they make use of the text encoder of the CLIP model, concluding from casual experiments that this sometimes gives better accuracy than NLI-based models.
▐ Application of the SimCLR method to musical data with out-of-domain generalization in million-scale music classification. See demo
▐ (https://spijkervet.github.io/CLMR/examples/clmr-onnxruntime-web/) or paper (https://arxiv.org/abs/2103.09410).
Case Studies ✍️
Libraries 🧰
▐ Quaterion is a framework for fine-tuning similarity learning models. The framework closes the "last mile" problem in training models for semantic search, recommendations, anomaly detection,
▐ extreme classification, matching engines, e.t.c. It is designed to combine the performance of pre-trained models with specialization for the custom task while avoiding slow and costly
▐ training.
- A library for
sentence-level embeddings.
▐ Developed on top of the well-known Transformers (https://github.com/huggingface/transformers) library, it provides an easy way to finetune Transformer-based models to obtain sequence-level
▐ embeddings.
▐ The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase
▐ identification.
- A metric learning library in
TensorFlow with a Keras-like API.
▐ It provides support for self-supervised contrastive learning and state-of-the-art methods such as SimCLR, SimSian, and Barlow Twins.
▐ A PyTorch library to train and inference with contextually-keyed word vectors augmented with part-of-speech tags to achieve multi-word queries.
▐ A PyTorch library to efficiently train self-supervised computer vision models with state-of-the-art techniques such as SimCLR, SimSian, Barlow Twins, BYOL, among others.
▐ A library that helps you benchmark pretrained and custom embedding models on tens of datasets and tasks with ease.
- A Python implementation of a number of popular
recommender algorithms.
▐ It supports incorporating user and item features to the traditional matrix factorization. It represents users and items as a sum of the latent representations of their features, thus
▐ achieving a better generalization.
▐ It provides efficient multicore and memory-independent implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA),
▐ Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec.
▐ It provides implementations of algorithms such as KNN, LFM, SLIM, NeuMF, FM, DeepFM, VAE and so on, in order to ensure fair comparison of recommender system benchmarks.
Tools ⚒️
▐ It supports UMAP, T-SNE, PCA, or custom techniques to analyze embeddings of encoders.
▐ It allows you to visualize the embedding space selecting explicitly the axis through algebraic formulas on the embeddings (like king-man+woman) and highlight specific items in the embedding
▐ space. It also supports implicit axes via PCA and t-SNE. See paper (https://arxiv.org/abs/1905.12099).
Approximate Nearest Neighbors ⚡
▐ It provides benchmarking of 20+ ANN algorithms on nine standard datasets with support to bring your dataset. (Medium Post
▐ (https://medium.com/towards-artificial-intelligence/how-to-choose-the-best-nearest-neighbors-algorithm-8d75d42b16ab?sk=889bc0006f5ff773e3a30fa283d91ee7))
▐ It is not the fastest ANN algorithm but achieves memory efficiency thanks to various quantization and indexing methods such as IVF, PQ, and IVF-PQ. (Tutorial
▐ (https://www.pinecone.io/learn/faiss-tutorial/))
▐ It is still one of the fastest ANN algorithms out there, requiring relatively a higher memory usage. (Paper: Efficient and robust approximate nearest neighbor search using Hierarchical
▐ Navigable Small World graphs (https://arxiv.org/abs/1603.09320))
▐ Paper: Accelerating Large-Scale Inference with Anisotropic Vector Quantization (https://arxiv.org/abs/1908.10396)
Papers 🔬
Dimensionality Reduction by
Learning an Invariant Mapping
▐ Published by Yann Le Cun et al. (2005), its main focus was on dimensionality reduction. However, the method proposed has excellent properties for metric learning such as preserving
▐ neighbourhood relationships and generalization to unseen data, and it has extensive applications with a great number of variations ever since. It is advised that you read this great post
▐ (https://medium.com/@maksym.bekuzarov/losses-explained-contrastive-loss-f8f57fe32246) to better understand its importance for metric learning.
▐ The paper introduces Triplet Loss, which can be seen as the "ImageNet moment" for deep metric learning. It is still one of the state-of-the-art methods and has a great number of
▐ applications in almost any data modality.
- A novel loss function
with better properties.
▐ It provides scale invariance, robustness against feature variance, and better convergence than Contrastive and Triplet Loss.
▐ Supervised metric learning without pairs or triplets.
▐ Although it is originally designed for the face recognition task, this loss function achieves state-of-the-art results in many other metric learning problems with a simpler and faster data
▐ feeding. It is also robust against unclean and unbalanced data when modified with sub-centers and a dynamic margin.
VICReg: Variance-Invariance-Covariance Regularization for
Self-Supervised Learning
▐ The paper introduces a method that explicitly avoids the collapse problem in high dimensions with a simple regularization term on the variance of the embeddings along each dimension
▐ individually. This new term can be incorporated into other methods to stabilize the training and performance improvements.
▐ The paper proposes using the mean centroid representation during training and retrieval for robustness against outliers and more stable features. It further reduces retrieval time and
▐ storage requirements, making it suitable for production deployments.
▐ It demonstrates among other things that
▐ - composition of data augmentations plays a critical role - Random Crop + Random Color distortion provides the best downstream classifier accuracy,
▐ - introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations,
▐ - and Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
▐ They also incorporates annotated pairs from natural language inference datasets into their contrastive learning framework in a supervised setting, showing that contrastive learning
▐ objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
▐ Mining informative negative instances are of central importance to deep metric learning (DML), however this task is intrinsically limited by mini-batch training, where only a mini-batch of
▐ instances is accessible at each iteration. In this paper, we identify a "slow drift" phenomena by observing that the embedding features drift exceptionally slow even as the model parameters
▐ are updating throughout the training process. This suggests that the features of instances computed at preceding iterations can be used to considerably approximate their features extracted
▐ by the current model.
Datasets ℹ️
▐ Practitioners can use any labeled or unlabelled data for metric learning with an appropriate method chosen. However, some datasets are particularly important in the literature for
▐ benchmarking or other ways, and we list them in this section.
- The Stanford Natural Language Inference Corpus,
serving as a useful benchmark.
▐ The dataset contains pairs of sentences labeled as contradiction, entailment, and neutral regarding semantic relationships. Useful to train semantic search models in metric learning.
▐ Modeled on the SNLI corpus, the dataset contains sentence pairs from various genres of spoken and written text, and it also offers a distinctive cross-genre generalization evaluation.
▐ Shared as a part of a Kaggle competition by Google, this dataset is more diverse and thus more interesting than the first version.
▐ The dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
▐ The dataset is published along with "Deep Metric Learning via Lifted Structured Feature Embedding" (https://github.com/rksltnl/Deep-Metric-Learning-CVPR16) paper.
▐ The dataset is published along with "The 2021 Image Similarity Dataset and Challenge" (http://arxiv.org/abs/2106.09672) paper.