When you train word embeddings, you may have used text8 corpus.
According to the author, text8 is made by cleaning Wikipedia text and cut it by 100MB.
text8 is frequently used in tutorials because it can be used without any preprocessings.

While text8 is useful for learning word embeddings in English, it is not useful for Japanese 🙂 .
Thus, I made text8 like corpus in Japanese, called ja.text8.
You can readily use the corpus without any preprocessings:

Hironsan/ja.text8

In this post, I will introduce how to make Japanese text8 corpus and train word embeddings by the corpus. This post is organized as follows:

What’s text8?
How to make ja.text8
Train word embeddings

Table of Contents

What’s text8?

As mentioned above, text8 is made by preprocessing Wikipedia. The preprocessings:

remove tables.
remove links to foreign language versions.
remove citations, footnotes, markups.
convert hypertext links to ordinary text.
spell out numbers. (e.g. “20” becomes “twenty zero”)
convert uppercase to lowercase.
convert characters not in the range a-z to a single space.

The following image shows a part of text8:

How to make ja.text8

The making procedure of Japanese version is as follows:

Download Japanese Wikipedia
Clean data by wikiextractor
Extract documents and Morphological Analysis

Download Japanese Wikipedia

First, download wikipedia’s dump data for making corpus. In this time, I used 2017/10/01 version. Let’s download jawiki-20171001-pages-articles.xml.bz2 as follows:

$ wget https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-pages-articles.xml.bz2

Clean data by wikiextractor

It is necessary to clean the downloaded data since it contains unused markups. For cleaning the markups, I used wikiextractor. After downloading wikiextractor, run the following command for cleaning text:

$ python WikiExtractor.py -o extracted jawiki-20171001-pages-articles.xml.bz2

Cleaned text is stored as follows:

<doc id="5" url="https://ja.wikipedia.org/wiki?curid=5" title="アンパサンド">
アンパサンド

アンパサンド (, &) とは「…と…」を意味する記号である。

その意味は"and [the symbol which] by itself [is] and"である。
...

</doc>

A document is surrounded by tag, and first line between the tags is the title of document.
In addition to the cleaning by wikiextractor, I remove tags, title line and new line, convert upper case to lower case, and save a document on a line.

Extract documents and Morphological Analysis

For the data preparation process, I first cleaned the text and selected documents from the ja.text8 corpus. While there are various document selection methods available (such as Wikipedia’s internal PageRank), I opted for a simple size-based approach: selecting documents until reaching a total corpus size of 100MB. The selected documents were then processed using morphological analysis and stored in a file.

That’s all. Let’s find the number of words and unique words.

>>> f = open("ja.text8")
>>> words = f.read().split()
>>> len(words)
16900026
>>> len(set(words))
290811

I found that the total number of words is 16,900,026, the number of unique words is 290,811.

Learning word2vec

Learn word embeddings by using ja.text8 and gensim. After downloading the corpus, run the following command. It’ll takes about 2 minutes to finish training.

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = word2vec.Text8Corpus('ja.text8')
model = word2vec.Word2Vec(sentences, size=200)

After training, let’s try to use it.

>>> model.most_similar(['日本'])
[('中国', 0.598496675491333),
 ('韓国', 0.5914819240570068),
 ('アメリカ', 0.5286925435066223),
 ('英国', 0.5090063810348511),
 ('台湾', 0.4761126637458801),
 ('米国', 0.45954638719558716),
 ('アメリカ合衆国', 0.45181626081466675),
 ('イギリス', 0.44740626215934753),
 ('ソ連', 0.43657147884368896),
 ('海外', 0.4325913190841675)]

Great!

Final words

In this post, I maked japanese text8 corpus and trained word embedding by the corpus. Thanks to ja.text8, we can train Japanese word embeddings easily.

Exploring the Japanese text8 Corpus: A Comprehensive Guide