awesome-sentence-embedding Awesome

Build Status GitHub - LICENSE

A curated list of pretrained sentence and word embedding models

Table of Contents

About This Repo

General Framework

Word Embeddings

date paper citation count training code pretrained models
- WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models N/A - RusVectōrēs
2013/01 Efficient Estimation of Word Representations in Vector Space 999+ C Word2Vec
2014/12 Word Representations via Gaussian Embedding 221 Cython -
2014/?? A Probabilistic Model for Learning Multi-Prototype Word Embeddings 127 DMTK -
2014/?? Dependency-Based Word Embeddings 719 C++ word2vecf
2014/?? GloVe: Global Vectors for Word Representation 999+ C GloVe
2015/06 Sparse Overcomplete Word Vector Representations 129 C++ -
2015/06 From Paraphrase Database to Compositional Paraphrase Model and Back 3 Theano PARAGRAM
2015/06 Non-distributional Word Vector Representations 68 Python WordFeat
2015/?? Joint Learning of Character and Word Embeddings 195 C -
2015/?? SensEmbed: Learning Sense Embeddings for Word and Relational Similarity 249 - SensEmbed
2015/?? Topical Word Embeddings 292 Cython
2016/02 Swivel: Improving Embeddings by Noticing What’s Missing 61 TF -
2016/03 Counter-fitting Word Vectors to Linguistic Constraints 232 Python counter-fitting(broken)
2016/05 Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec 91 Chainer -
2016/06 Siamese CBOW: Optimizing Word Embeddings for Sentence Representations 166 Theano Siamese CBOW
2016/06 Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations 58 Go lexvec
2016/07 Enriching Word Vectors with Subword Information 999+ C++ fastText
2016/08 Morphological Priors for Probabilistic Neural Word Embeddings 34 Theano -
2016/11 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks 359 C++ charNgram2vec
2016/12 ConceptNet 5.5: An Open Multilingual Graph of General Knowledge 604 Python Numberbatch
2016/?? Learning Word Meta-Embeddings 58 - Meta-Emb(broken)
2017/02 Offline bilingual word vectors, orthogonal transformations and the inverted softmax 336 Python -
2017/04 Multimodal Word Distributions 57 TF word2gm
2017/05 Poincaré Embeddings for Learning Hierarchical Representations 413 Pytorch -
2017/06 Context encoders as a simple but powerful extension of word2vec 13 Python -
2017/06 Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints 99 TF Attract-Repel
2017/08 Learning Chinese Word Representations From Glyphs Of Characters 44 C -
2017/08 Making Sense of Word Embeddings 92 Python sensegram
2017/09 Hash Embeddings for Efficient Word Representations 25 Keras -
2017/10 BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages 91 Gensim BPEmb
2017/11 SPINE: SParse Interpretable Neural Embeddings 48 Pytorch SPINE
2017/?? AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP 161 Gensim AraVec
2017/?? Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics 25 C -
2017/?? Dict2vec : Learning Word Embeddings using Lexical Dictionaries 49 C++ Dict2vec
2017/?? Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components 63 C -
2018/04 Representation Tradeoffs for Hyperbolic Embeddings 120 Pytorch h-MDS
2018/04 Dynamic Meta-Embeddings for Improved Sentence Representations 60 Pytorch DME/CDME
2018/05 Analogical Reasoning on Chinese Morphological and Semantic Relations 128 - ChineseWordVectors
2018/06 Probabilistic FastText for Multi-Sense Word Embeddings 39 C++ Probabilistic FastText
2018/09 Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks 3 TF SynGCN
2018/09 FRAGE: Frequency-Agnostic Word Representation 64 Pytorch -
2018/12 Wikipedia2Vec: An Optimized Tool for LearningEmbeddings of Words and Entities from Wikipedia 17 Cython Wikipedia2Vec
2018/?? Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings 106 - ChineseEmbedding
2018/?? cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information 45 C++ -
2019/02 VCWE: Visual Character-Enhanced Word Embeddings 5 Pytorch VCWE
2019/05 Learning Cross-lingual Embeddings from Twitter via Distant Supervision 2 Text -
2019/08 An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning 5 TF -
2019/08 ViCo: Word Embeddings from Visual Co-occurrences 7 Pytorch ViCo
2019/11 Spherical Text Embedding 25 C -
2019/?? Unsupervised word embeddings capture latent knowledge from materials science literature 150 Gensim -

OOV Handling

Contextualized Word Embeddings

date paper citation count code pretrained models
- Language Models are Unsupervised Multitask Learners N/A TF
Pytorch, TF2.0
Keras
GPT-2(117M, 124M, 345M, 355M, 774M, 1558M)
2017/08 Learned in Translation: Contextualized Word Vectors 524 Pytorch
Keras
CoVe
2018/01 Universal Language Model Fine-tuning for Text Classification 167 Pytorch ULMFit(English, Zoo)
2018/02 Deep contextualized word representations 999+ Pytorch
TF
ELMO(AllenNLP, TF-Hub)
2018/04 Efficient Contextualized Representation:Language Model Pruning for Sequence Labeling 26 Pytorch LD-Net
2018/07 Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation 120 Pytorch ELMo
2018/08 Direct Output Connection for a High-Rank Language Model 24 Pytorch DOC
2018/10 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 999+ TF
Keras
Pytorch, TF2.0
MXNet
PaddlePaddle
TF
Keras
BERT(BERT, ERNIE, KoBERT)
2018/?? Contextual String Embeddings for Sequence Labeling 486 Pytorch Flair
2018/?? Improving Language Understanding by Generative Pre-Training 999+ TF
Keras
Pytorch, TF2.0
GPT
2019/01 Multi-Task Deep Neural Networks for Natural Language Understanding 364 Pytorch MT-DNN
2019/01 BioBERT: pre-trained biomedical language representation model for biomedical text mining 634 TF BioBERT
2019/01 Cross-lingual Language Model Pretraining 639 Pytorch
Pytorch, TF2.0
XLM
2019/01 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context 754 TF
Pytorch
Pytorch, TF2.0
Transformer-XL
2019/02 Efficient Contextual Representation Learning Without Softmax Layer 2 Pytorch -
2019/03 SciBERT: Pretrained Contextualized Embeddings for Scientific Text 124 Pytorch, TF SciBERT
2019/04 Publicly Available Clinical BERT Embeddings 229 Text clinicalBERT
2019/04 ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission 84 Pytorch ClinicalBERT
2019/05 ERNIE: Enhanced Language Representation with Informative Entities 210 Pytorch ERNIE
2019/05 Unified Language Model Pre-training for Natural Language Understanding and Generation 278 Pytorch UniLMv1(unilm1-large-cased, unilm1-base-cased)
2019/05 HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization 81   -
2019/06 Pre-Training with Whole Word Masking for Chinese BERT 98 Pytorch, TF BERT-wwm
2019/06 XLNet: Generalized Autoregressive Pretraining for Language Understanding 999+ TF
Pytorch, TF2.0
XLNet
2019/07 ERNIE 2.0: A Continual Pre-training Framework for Language Understanding 107 PaddlePaddle ERNIE 2.0
2019/07 SpanBERT: Improving Pre-training by Representing and Predicting Spans 282 Pytorch SpanBERT
2019/07 RoBERTa: A Robustly Optimized BERT Pretraining Approach 999+ Pytorch
Pytorch, TF2.0
RoBERTa
2019/09 Subword ELMo 1 Pytorch -
2019/09 Knowledge Enhanced Contextual Word Representations 115   -
2019/09 TinyBERT: Distilling BERT for Natural Language Understanding 129   -
2019/09 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 136 Pytorch Megatron-LM(BERT-345M, GPT-2-345M)
2019/09 MultiFiT: Efficient Multi-lingual Language Model Fine-tuning 29 Pytorch -
2019/09 Extreme Language Model Compression with Optimal Subwords and Shared Projections 32   -
2019/09 MULE: Multimodal Universal Language Embedding 5   -
2019/09 Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks 51   -
2019/09 K-BERT: Enabling Language Representation with Knowledge Graph 59   -
2019/09 UNITER: Learning UNiversal Image-TExt Representations 60   -
2019/09 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations 803 TF -
2019/10 BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension 349 Pytorch BART(bart.base, bart.large, bart.large.mnli, bart.large.cnn, bart.large.xsum)
2019/10 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 481 Pytorch, TF2.0 DistilBERT
2019/10 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 696 TF T5
2019/11 CamemBERT: a Tasty French Language Model 102 - CamemBERT
2019/11 ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations 15 Pytorch -
2019/11 Unsupervised Cross-lingual Representation Learning at Scale 319 Pytorch XLM-R (XLM-RoBERTa)(xlmr.large, xlmr.base)
2020/01 ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training 35 Pytorch ProphetNet(ProphetNet-large-16GB, ProphetNet-large-160GB)
2020/02 CodeBERT: A Pre-Trained Model for Programming and Natural Languages 25 Pytorch CodeBERT
2020/02 UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training 33 Pytorch -
2020/03 ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators 203 TF ELECTRA(ELECTRA-Small, ELECTRA-Base, ELECTRA-Large)
2020/04 MPNet: Masked and Permuted Pre-training for Language Understanding 5 Pytorch MPNet
2020/05 ParsBERT: Transformer-based Model for Persian Language Understanding 1 Pytorch ParsBERT
2020/05 Language Models are Few-Shot Learners 382 - -
2020/07 InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training 12 Pytorch -

Pooling Methods

Encoders

date paper citation count code model_name
- Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings N/A Python AraSIF
2014/05 Distributed Representations of Sentences and Documents 999+ Pytorch
Python
Doc2Vec
2014/11 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models 849 Theano
Pytorch
VSE
2015/06 Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books 795 Theano
TF
Pytorch, Torch
SkipThought
2015/11 Order-Embeddings of Images and Language 354 Theano order-embedding
2015/11 Towards Universal Paraphrastic Sentence Embeddings 411 Theano ParagramPhrase
2015/?? From Word Embeddings to Document Distances 999+ C, Python Word Mover’s Distance
2016/02 Learning Distributed Representations of Sentences from Unlabelled Data 363 Python FastSent
2016/07 Charagram: Embedding Words and Sentences via Character n-grams 144 Theano Charagram
2016/11 Learning Generic Sentence Representations Using Convolutional Neural Networks 76 Theano ConvSent
2017/03 Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features 319 C++ Sent2Vec
2017/04 Learning to Generate Reviews and Discovering Sentiment 293 TF
Pytorch
Pytorch
Sentiment Neuron
2017/05 Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings 60 Theano GRAN
2017/05 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data 999+ Pytorch InferSent
2017/07 VSE++: Improving Visual-Semantic Embeddings with Hard Negatives 132 Pytorch VSE++
2017/08 Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm 357 Keras
Pytorch
DeepMoji
2017/09 StarSpace: Embed All The Things! 129 C++ StarSpace
2017/10 DisSent: Learning Sentence Representations from Explicit Discourse Relations 47 Pytorch DisSent
2017/11 Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations 128 Theano para-nmt
2017/11 Dual-Path Convolutional Image-Text Embedding with Instance Loss 44 Matlab Image-Text-Embedding
2018/03 An efficient framework for learning sentence representations 183 TF Quick-Thought
2018/03 Universal Sentence Encoder 564 TF-Hub USE
2018/04 End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions 14 Theano DEISTE
2018/04 Learning general purpose distributed sentence representations via large scale multi-task learning 198 Pytorch GenSen
2018/06 Embedding Text in Hyperbolic Spaces 50 TF HyperText
2018/07 Representation Learning with Contrastive Predictive Coding 736 Keras CPC
2018/08 Context Mover’s Distance & Barycenters: Optimal transport of contexts for building representations 8 Python CMD
2018/09 Learning Universal Sentence Representations with Mean-Max Attention Autoencoder 14 TF Mean-MaxAAE
2018/10 Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model 35 TF-Hub USE-xling
2018/10 Improving Sentence Representations with Consensus Maximisation 4 - Multi-view
2018/10 BioSentVec: creating sentence embeddings for biomedical texts 70 Python BioSentVec
2018/11 Word Mover’s Embedding: From Word2Vec to Document Embedding 47 C, Python WordMoversEmbeddings
2018/11 A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks 76 Pytorch HMTL
2018/12 Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond 238 Pytorch LASER
2018/?? Convolutional Neural Network for Universal Sentence Embeddings 6 Theano CSE
2019/01 No Training Required: Exploring Random Encoders for Sentence Classification 54 Pytorch randsent
2019/02 CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model 4 Pytorch CMOW
2019/07 GLOSS: Generative Latent Optimization of Sentence Representations 1 - GLOSS
2019/07 Multilingual Universal Sentence Encoder 52 TF-Hub MultilingualUSE
2019/08 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks 261 Pytorch Sentence-BERT
2020/02 SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models 11 Pytorch SBERT-WK
2020/06 DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations 4 Pytorch DeCLUTR
2020/07 Language-agnostic BERT Sentence Embedding 5 TF-Hub LaBSE
2020/11 On the Sentence Embeddings from Pre-trained Language Models 0 TF BERT-flow

Evaluation

Misc

Vector Mapping

Articles