When you train word embeddings, you may have used text8 corpus.
According to the author, text8 is made by cleaning Wikipedia text and cut it by 100MB.
text8 is frequently used in tutorials because it can be used without any preprocessings.
While text8 is useful for learning word embeddings in English, it is not useful for Japanese 🙂 .
Thus, I made text8 like corpus in Japanese, called
You can readily use the corpus without any preprocessings:
In this post, I will introduce how to make Japanese text8 corpus and train word embeddings by the corpus. This post is organized as follows:
- What’s text8?
- How to make ja.text8
- Train word embeddings
As mentioned above, text8 is made by preprocessing Wikipedia. The preprocessings:
- remove tables.
- remove links to foreign language versions.
- remove citations, footnotes, markups.
- convert hypertext links to ordinary text.
- spell out numbers. (e.g. “20” becomes “twenty zero”)
- convert uppercase to lowercase.
- convert characters not in the range a-z to a single space.
The following image shows a part of text8:
How to make ja.text8
The making procedure of Japanese version is as follows:
- Download Japanese Wikipedia
- Clean data by wikiextractor
- Extract documents and Morphological Analysis
Download Japanese Wikipedia
First, download wikipedia’s dump data for making corpus. In this time, I used 2017/10/01 version. Let’s download
jawiki-20171001-pages-articles.xml.bz2 as follows:
$ wget https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-pages-articles.xml.bz2
Clean data by wikiextractor
It is necessary to clean the downloaded data since it contains unused markups. For cleaning the markups, I used wikiextractor. After downloading wikiextractor, run the following command for cleaning text:
$ python WikiExtractor.py -o extracted jawiki-20171001-pages-articles.xml.bz2
Cleaned text is stored as follows:
<doc id="5" url="https://ja.wikipedia.org/wiki?curid=5" title="アンパサンド"> アンパサンド アンパサンド (, &) とは「…と…」を意味する記号である。 その意味は"and [the symbol which] by itself [is] and"である。 ... </doc>
A document is surrounded by
In addition to the cleaning by wikiextractor, I remove
Extract documents and Morphological Analysis
After cleaning the text, select documents that is included in ja.text8 corpus. Although there are many document selecting methods (e.g. Wikipedia internal pagerank), I used the method that select documents until the corpus size reaches 100MB. After selecting the documents, they are splitted by morphological analysis and saved in a file.
That’s all. Let’s find the number of words and unique words.
>>> f = open("ja.text8") >>> words = f.read().split() >>> len(words) 16900026 >>> len(set(words)) 290811
I found that the total number of words is 16,900,026, the number of unique words is 290,811.
Learn word embeddings by using ja.text8 and gensim. After downloading the corpus, run the following command. It’ll takes about 2 minutes to finish training.
import logging from gensim.models import word2vec logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = word2vec.Text8Corpus('ja.text8') model = word2vec.Word2Vec(sentences, size=200)
After training, let’s try to use it.
>>> model.most_similar(['日本']) [('中国', 0.598496675491333), ('韓国', 0.5914819240570068), ('アメリカ', 0.5286925435066223), ('英国', 0.5090063810348511), ('台湾', 0.4761126637458801), ('米国', 0.45954638719558716), ('アメリカ合衆国', 0.45181626081466675), ('イギリス', 0.44740626215934753), ('ソ連', 0.43657147884368896), ('海外', 0.4325913190841675)]
In this post, I maked japanese text8 corpus and trained word embedding by the corpus. Thanks to ja.text8, we can train Japanese word embeddings easily.