The List of Pretrained Word Embeddings

The information of word embedding is on the GitHub.

Introduction

word embeddings is a technique that expresses a word as a real number vector of low dimension (about 200 dimensions or higher). It is features that words that have similar meaning can be made to correspond to close vector and obtain meaningful results (e.g. king – man + women = queen) by adding(subtracting) vectors.

Word embedding is an important technique used for various NLP applications such as part-of-speech tagging, information retrieval, question answering etc. However, It is quite a troublesome work to prepare word embeddings. We download large-scale data, preprocess it, learn it over a long time, check the result and perform many hyperparameters tuning・・・.

If you just want to use word embeddings, you should use the pre-trained vector. So, I list pre-trained word embeddings you can use soon.

The information of word embedding is on the GitHub.

Commonly used: Word2Vec, GloVe, fastText

Word2Vec

Comment Word2Vec pre-trained vector. If you don’t know what to use, use this.
Year 2013
URL https://code.google.com/archive/p/word2vec/

You can obtain Multilingual pre-trained vector as following:

GloVe

Comment GloVe: Stanford developed. It is claiming that it is better than Word2Vec. GloVe combines global matrix decomposition and local context window.
Year 2014
URL http://nlp.stanford.edu/projects/glove/

fastText

Comment FastText created by genius Mikolov who made Word2Vec. Learning is very fast! In order to consider morphemes, each word is represented by the character ngram and vector expressions of them is learned.
Year 2016
URL https://github.com/icoxfog417/fastTextJapaneseTutorial

* it includes only in Japanese…

Other Pre-trained Vectors

Dependency-Based Word Embeddings

Comment Word embeddings by Levy et al. By learning dependency-based contexts, it became strong against syntactic similarity. It might be good if you want to use it for syntactic similarity.
Year 2014
URL https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Meta-Embeddings

Comment Meta-Embeddings published in ACL 2016. By combining different public embedding sets, better vectors(meta-embeddings) are generated.
Year 2016
URL http://cistern.cis.lmu.de/meta-emb/

LexVec

Comment LexVec also published in ACL 2016. In word similarity task, some results exceed Word2Vec.
Year 2016
URL https://github.com/alexandres/lexvec

Extra: How to use the pre-trained vector

In this section, I explain how to use pre-trained vectors. Before reading this section, download word2vec pre-trained vectors.

Downloaded it?
It is super easy to read it. Just install gensim and write the following code.

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

If you want to evaluate the model, write the following code. Note that you need to download questions-words.txt which is evaluation data before running.

import logging
import pprint

# for logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Load evaluation dataset of analogy task 
model.accuracy('questions-words.txt')
# execute analogy task like king - man + woman = queen
pprint.pprint(model.most_similar(positive=['woman', 'king'], negative=['man']))

After executing the code, the following evaluation result is output:

2017-01-20 09:29:11,767 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,891 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,994 : INFO : precomputing L2-norms of word weight vectors
2017-01-20 09:30:42,097 : INFO : capital-common-countries: 83.6% (423/506)
2017-01-20 09:30:49,899 : INFO : capital-world: 82.7% (1144/1383)
2017-01-20 09:30:50,795 : INFO : currency: 39.8% (51/128)
2017-01-20 09:31:03,579 : INFO : city-in-state: 74.6% (1739/2330)
2017-01-20 09:31:05,574 : INFO : family: 90.1% (308/342)
2017-01-20 09:31:09,928 : INFO : gram1-adjective-to-adverb: 32.3% (262/812)
2017-01-20 09:31:12,052 : INFO : gram2-opposite: 50.5% (192/380)
2017-01-20 09:31:19,719 : INFO : gram3-comparative: 91.9% (1224/1332)
2017-01-20 09:31:23,574 : INFO : gram4-superlative: 88.0% (618/702)
2017-01-20 09:31:28,210 : INFO : gram5-present-participle: 79.8% (694/870)
2017-01-20 09:31:35,082 : INFO : gram6-nationality-adjective: 97.1% (1193/1229)
2017-01-20 09:31:43,390 : INFO : gram7-past-tense: 66.5% (986/1482)
2017-01-20 09:31:49,136 : INFO : gram8-plural: 85.6% (849/992)
2017-01-20 09:31:53,394 : INFO : gram9-plural-verbs: 68.9% (484/702)
2017-01-20 09:31:53,396 : INFO : total: 77.1% (10167/13190)
[('queen', 0.7118192315101624),
('monarch', 0.6189674139022827),
('princess', 0.5902431011199951),
('crown_prince', 0.5499460697174072),
('prince', 0.5377321839332581),
('kings', 0.5236844420433044),
('Queen_Consort', 0.5235946178436279),
('queens', 0.5181134343147278),
('sultan', 0.5098593235015869),
('monarchy', 0.5087412595748901)]

Total accuracy is 77.1%!

Word vectors such as GloVe can be read in almost the same way.

Conclusion

In this article, I introduced several pre-trained vectors.
If there is no special motivation, I recommend you to use these pre-trained vectors.

The information of word embedding is on the GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *