The information of word embedding is on the GitHub.

Table of Contents

Introduction

Word embeddings are a technique that represents words as vectors of real numbers in a low-dimensional space (typically 200 dimensions or more). A key feature is that words with similar meanings are mapped to vectors that are close to each other in this space.

This allows for interesting mathematical operations with semantic meaning – for example, the vector calculation “king – man + woman” results in a vector very close to “queen”, demonstrating how the embedding captures meaningful relationships between words.

Word embeddings are essential for many Natural Language Processing (NLP) applications, including part-of-speech tagging, information retrieval, and question answering systems. However, creating high-quality word embeddings is a resource-intensive process.

It requires downloading and preprocessing large datasets, conducting lengthy training sessions, evaluating results, and performing extensive hyperparameter optimization – all of which can be time-consuming and computationally demanding.

If you just want to use word embeddings, you should use the pre-trained vector. So, I list pre-trained word embeddings you can use soon.

The information of word embedding is on the GitHub.

Commonly used: Word2Vec, GloVe, fastText

Word2Vec

Comment	Word2Vec pre-trained vector. If you don’t know what to use, use this.
Year	2013
URL	https://code.google.com/archive/p/word2vec/

You can obtain Multilingual pre-trained vector as following:

https://github.com/Kyubyong/wordvectors

GloVe

Comment	GloVe: Stanford developed. It is claiming that it is better than Word2Vec. GloVe combines global matrix decomposition and local context window.
Year	2014
URL	http://nlp.stanford.edu/projects/glove/

fastText

Comment	FastText created by genius Mikolov who made Word2Vec. Learning is very fast! In order to consider morphemes, each word is represented by the character ngram and vector expressions of them is learned.
Year	2016
URL	https://github.com/icoxfog417/fastTextJapaneseTutorial

* it includes only in Japanese…

Other Pre-trained Vectors

Dependency-Based Word Embeddings

Comment	Word embeddings by Levy et al. By learning dependency-based contexts, it became strong against syntactic similarity. It might be good if you want to use it for syntactic similarity.
Year	2014
URL	https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

Meta-Embeddings

Comment	Meta-Embeddings published in ACL 2016. By combining different public embedding sets, better vectors(meta-embeddings) are generated.
Year	2016
URL	http://cistern.cis.lmu.de/meta-emb/

LexVec

Comment	LexVec also published in ACL 2016. In word similarity task, some results exceed Word2Vec.
Year	2016
URL	https://github.com/alexandres/lexvec

Extra: How to use the pre-trained vector

In this section, I explain how to use pre-trained vectors. Before reading this section, download word2vec pre-trained vectors.

Downloaded it?
It is super easy to read it. Just install gensim and write the following code.

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

If you want to evaluate the model, write the following code. Note that you need to download questions-words.txt which is evaluation data before running.

import logging
import pprint

# for logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Load evaluation dataset of analogy task 
model.accuracy('questions-words.txt')
# execute analogy task like king - man + woman = queen
pprint.pprint(model.most_similar(positive=['woman', 'king'], negative=['man']))

After executing the code, the following evaluation result is output:

2017-01-20 09:29:11,767 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,891 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin
2017-01-20 09:30:10,994 : INFO : precomputing L2-norms of word weight vectors
2017-01-20 09:30:42,097 : INFO : capital-common-countries: 83.6% (423/506)
2017-01-20 09:30:49,899 : INFO : capital-world: 82.7% (1144/1383)
2017-01-20 09:30:50,795 : INFO : currency: 39.8% (51/128)
2017-01-20 09:31:03,579 : INFO : city-in-state: 74.6% (1739/2330)
2017-01-20 09:31:05,574 : INFO : family: 90.1% (308/342)
2017-01-20 09:31:09,928 : INFO : gram1-adjective-to-adverb: 32.3% (262/812)
2017-01-20 09:31:12,052 : INFO : gram2-opposite: 50.5% (192/380)
2017-01-20 09:31:19,719 : INFO : gram3-comparative: 91.9% (1224/1332)
2017-01-20 09:31:23,574 : INFO : gram4-superlative: 88.0% (618/702)
2017-01-20 09:31:28,210 : INFO : gram5-present-participle: 79.8% (694/870)
2017-01-20 09:31:35,082 : INFO : gram6-nationality-adjective: 97.1% (1193/1229)
2017-01-20 09:31:43,390 : INFO : gram7-past-tense: 66.5% (986/1482)
2017-01-20 09:31:49,136 : INFO : gram8-plural: 85.6% (849/992)
2017-01-20 09:31:53,394 : INFO : gram9-plural-verbs: 68.9% (484/702)
2017-01-20 09:31:53,396 : INFO : total: 77.1% (10167/13190)
[('queen', 0.7118192315101624),
('monarch', 0.6189674139022827),
('princess', 0.5902431011199951),
('crown_prince', 0.5499460697174072),
('prince', 0.5377321839332581),
('kings', 0.5236844420433044),
('Queen_Consort', 0.5235946178436279),
('queens', 0.5181134343147278),
('sultan', 0.5098593235015869),
('monarchy', 0.5087412595748901)]

Total accuracy is 77.1%!

Word vectors such as GloVe can be read in almost the same way.

Conclusion

In this article, I introduced several pre-trained vectors.
If there is no special motivation, I recommend you to use these pre-trained vectors.

The information of word embedding is on the GitHub.

The List Of Pretrained Word Embeddings

Introduction

Commonly used: Word2Vec, GloVe, fastText

Word2Vec

GloVe

fastText

Other Pre-trained Vectors

Dependency-Based Word Embeddings

Meta-Embeddings

LexVec

Extra: How to use the pre-trained vector

Conclusion

Leave a Comment Cancel reply