word embeddings is a technique that expresses a word as a real number vector of low dimension (about 200 dimensions or higher). It is features that words that have similar meaning can be made to correspond to close vector and obtain meaningful results (e.g. king – man + women = queen) by adding(subtracting) vectors.
Word embedding is an important technique used for various NLP applications such as part-of-speech tagging, information retrieval, question answering etc. However, It is quite a troublesome work to prepare word embeddings. We download large-scale data, preprocess it, learn it over a long time, check the result and perform many hyperparameters tuning・・・.
If you just want to use word embeddings, you should use the pre-trained vector. So, I list pre-trained word embeddings you can use soon.
Commonly used: Word2Vec, GloVe, fastText
|Comment||Word2Vec pre-trained vector. If you don’t know what to use, use this.|
You can obtain Multilingual pre-trained vector as following:
|Comment||GloVe: Stanford developed. It is claiming that it is better than Word2Vec. GloVe combines global matrix decomposition and local context window.|
|Comment||FastText created by genius Mikolov who made Word2Vec. Learning is very fast! In order to consider morphemes, each word is represented by the character ngram and vector expressions of them is learned.|
* it includes only in Japanese…
Other Pre-trained Vectors
Dependency-Based Word Embeddings
|Comment||Word embeddings by Levy et al. By learning dependency-based contexts, it became strong against syntactic similarity. It might be good if you want to use it for syntactic similarity.|
|Comment||Meta-Embeddings published in ACL 2016. By combining different public embedding sets, better vectors(meta-embeddings) are generated.|
|Comment||LexVec also published in ACL 2016. In word similarity task, some results exceed Word2Vec.|
Extra: How to use the pre-trained vector
In this section, I explain how to use pre-trained vectors. Before reading this section, download word2vec pre-trained vectors.
It is super easy to read it. Just install gensim and write the following code.
import gensim # Load Google's pre-trained Word2Vec model. model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
If you want to evaluate the model, write the following code. Note that you need to download questions-words.txt which is evaluation data before running.
import logging import pprint # for logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # Load evaluation dataset of analogy task model.accuracy('questions-words.txt') # execute analogy task like king - man + woman = queen pprint.pprint(model.most_similar(positive=['woman', 'king'], negative=['man']))
After executing the code, the following evaluation result is output:
2017-01-20 09:29:11,767 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin 2017-01-20 09:30:10,891 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin 2017-01-20 09:30:10,994 : INFO : precomputing L2-norms of word weight vectors 2017-01-20 09:30:42,097 : INFO : capital-common-countries: 83.6% (423/506) 2017-01-20 09:30:49,899 : INFO : capital-world: 82.7% (1144/1383) 2017-01-20 09:30:50,795 : INFO : currency: 39.8% (51/128) 2017-01-20 09:31:03,579 : INFO : city-in-state: 74.6% (1739/2330) 2017-01-20 09:31:05,574 : INFO : family: 90.1% (308/342) 2017-01-20 09:31:09,928 : INFO : gram1-adjective-to-adverb: 32.3% (262/812) 2017-01-20 09:31:12,052 : INFO : gram2-opposite: 50.5% (192/380) 2017-01-20 09:31:19,719 : INFO : gram3-comparative: 91.9% (1224/1332) 2017-01-20 09:31:23,574 : INFO : gram4-superlative: 88.0% (618/702) 2017-01-20 09:31:28,210 : INFO : gram5-present-participle: 79.8% (694/870) 2017-01-20 09:31:35,082 : INFO : gram6-nationality-adjective: 97.1% (1193/1229) 2017-01-20 09:31:43,390 : INFO : gram7-past-tense: 66.5% (986/1482) 2017-01-20 09:31:49,136 : INFO : gram8-plural: 85.6% (849/992) 2017-01-20 09:31:53,394 : INFO : gram9-plural-verbs: 68.9% (484/702) 2017-01-20 09:31:53,396 : INFO : total: 77.1% (10167/13190) [('queen', 0.7118192315101624), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235946178436279), ('queens', 0.5181134343147278), ('sultan', 0.5098593235015869), ('monarchy', 0.5087412595748901)]
Total accuracy is 77.1%!
Word vectors such as GloVe can be read in almost the same way.
In this article, I introduced several pre-trained vectors.
If there is no special motivation, I recommend you to use these pre-trained vectors.