ja.text8 – Japanese text8 Corpus for Word Embeddings –

When you train word embeddings, you may have used text8 corpus.
According to the author, text8 is made by cleaning Wikipedia text and cut it by 100MB.
text8 is frequently used in tutorials because it can be used without any preprocessings.

While text8 is useful for learning word embeddings in English, it is not useful for Japanese 🙂 .
Thus, I made text8 like corpus in Japanese, called ja.text8.
You can readily use the corpus without any preprocessings:

In this post, I will introduce how to make Japanese text8 corpus and train word embeddings by the corpus. This post is organized as follows:

  • What’s text8?
  • How to make ja.text8
  • Train word embeddings

What’s text8?

As mentioned above, text8 is made by preprocessing Wikipedia. The preprocessings:

  • remove tables.
  • remove links to foreign language versions.
  • remove citations, footnotes, markups.
  • convert hypertext links to ordinary text.
  • spell out numbers. (e.g. “20” becomes “twenty zero”)
  • convert uppercase to lowercase.
  • convert characters not in the range a-z to a single space.

The following image shows a part of text8:

How to make ja.text8

The making procedure of Japanese version is as follows:

  1. Download Japanese Wikipedia
  2. Clean data by wikiextractor
  3. Extract documents and Morphological Analysis

Download Japanese Wikipedia

First, download wikipedia’s dump data for making corpus. In this time, I used 2017/10/01 version. Let’s download jawiki-20171001-pages-articles.xml.bz2 as follows:

$ wget https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-pages-articles.xml.bz2

Clean data by wikiextractor


It is necessary to clean the downloaded data since it contains unused markups. For cleaning the markups, I used wikiextractor. After downloading wikiextractor, run the following command for cleaning text:

$ python WikiExtractor.py -o extracted jawiki-20171001-pages-articles.xml.bz2 


Cleaned text is stored as follows:

<doc id="5" url="https://ja.wikipedia.org/wiki?curid=5" title="アンパサンド">
アンパサンド

アンパサンド (, &) とは「…と…」を意味する記号である。

その意味は"and [the symbol which] by itself [is] and"である。
...

</doc>


A document is surrounded by tag, and first line between the tags is the title of document.
In addition to the cleaning by wikiextractor, I remove tags, title line and new line, convert upper case to lower case, and save a document on a line.

Extract documents and Morphological Analysis

After cleaning the text, select documents that is included in ja.text8 corpus. Although there are many document selecting methods (e.g. Wikipedia internal pagerank), I used the method that select documents until the corpus size reaches 100MB. After selecting the documents, they are splitted by morphological analysis and saved in a file.

That’s all. Let’s find the number of words and unique words.

>>> f = open("ja.text8")
>>> words = f.read().split()
>>> len(words)
16900026
>>> len(set(words))
290811

I found that the total number of words is 16,900,026, the number of unique words is 290,811.

Learning word2vec

Learn word embeddings by using ja.text8 and gensim. After downloading the corpus, run the following command. It’ll takes about 2 minutes to finish training.

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = word2vec.Text8Corpus('ja.text8')
model = word2vec.Word2Vec(sentences, size=200)

After training, let’s try to use it.

>>> model.most_similar(['日本'])
[('中国', 0.598496675491333),
 ('韓国', 0.5914819240570068),
 ('アメリカ', 0.5286925435066223),
 ('英国', 0.5090063810348511),
 ('台湾', 0.4761126637458801),
 ('米国', 0.45954638719558716),
 ('アメリカ合衆国', 0.45181626081466675),
 ('イギリス', 0.44740626215934753),
 ('ソ連', 0.43657147884368896),
 ('海外', 0.4325913190841675)]


Great!

Final words

In this post, I maked japanese text8 corpus and trained word embedding by the corpus. Thanks to ja.text8, we can train Japanese word embeddings easily.

References

Leave a Reply

Your email address will not be published. Required fields are marked *