GloVe Word Embeddingsを試してみる

ほとんどマニュアル通りに行っただけ。
GloVe Word Embeddings
<環境>

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932   
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                  
[5] LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] RevoUtils_10.0.6     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.
> packageVersion("text2vec")
[1] ‘0.5.0’

【0】使用するライブラリの読み込み

library(text2vec)

【1】データの取得と読み込み

download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip("text8.zip", "text8")
text8 <- readLines("text8", warn = FALSE)

【2】単語を抽出する

it <- itoken(text8)
voc <- create_vocabulary(it)
voc <- prune_vocabulary(voc, term_count_min = 5)
> voc
Number of docs: 1 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
              term term_count doc_count
    1:   kentauros          5         1
    2:   tornatore          5         1
    3: phantastica          5         1
    4:   steinhoff          5         1
    5:      minthe          5         1
   ---                                 
71286:          in     372201         1
71287:         one     411764         1
71288:         and     416629         1
71289:          of     593677         1
71290:         the    1061396         1

【3】TCM(term-co-occurence matrix)を作成する

tcm <- create_tcm(it, vocab_vectorizer(voc), skip_grams_window = 5)

【4】ワードベクトルを作成する

glove <- GlobalVectors$new(word_vectors_size = 50, vocabulary = voc, x_max = 10)
main <- glove$fit_transform(tcm, n_iter = 20)
context <- glove$components
word_vectors <- main + t(context)

【5】うまくいっているか試してみる(paris - france + germany)

paris <- word_vectors["paris",, drop = FALSE]
france <- word_vectors["france",, drop = FALSE]
germany <- word_vectors["germany",, drop = FALSE]

answer <- paris - france + germany

cos_sim <- sim2(x = word_vectors, y = answer, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)
   berlin     paris    munich   germany    vienna 
0.7652117 0.7228930 0.6976820 0.6499416 0.6340590