Rの「text2vec」パッケージ

Rの「text2vec」パッケージを使ってみる。

library(text2vec)
a <- c("1:He loves her.", "2:She loves her.", "3:Her friends loved her")

【1】単語を抽出する

it <- itoken(a, tolower, word_tokenizer)
(voc <- create_vocabulary(it))
Number of docs: 3 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
      term term_count doc_count
1:       1          1         1
2: friends          1         1
3:       2          1         1
4:      he          1         1
5:   loved          1         1
6:     she          1         1
7:       3          1         1
8:   loves          2         2
9:     her          4         3
  • 「term_conunt」は単語の出現回数
  • 「doc_count」はその単語を含む文章の個数
> class(voc)
[1] "text2vec_vocabulary" "data.frame"         
> dim(voc)
[1] 9 3

【1'】「tm」パッケージのstopwordsを使う

library(tm)
it <- itoken(a, tolower, word_tokenizer)
(voc <- create_vocabulary(it, stopwords = stopwords("en")))
Number of docs: 3 
174 stopwords: i, me, my, myself, we, our ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
      term term_count doc_count
1:       1          1         1
2: friends          1         1
3:       2          1         1
4:   loved          1         1
5:       3          1         1
6:   loves          2         2

【1''】「tm」パッケージのstemDocumentを使う

a <- stemDocument(a)
it <- itoken(a, tolower, word_tokenizer)
(voc <- create_vocabulary(it, stopwords = stopwords("en")))
Number of docs: 3 
174 stopwords: i, me, my, myself, we, our ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
     term term_count doc_count
1:      1          1         1
2: friend          1         1
3:      2          1         1
4:      3          1         1
5:   love          3         3

【1'''】「tm」パッケージのremoveNumbersを使う

a <- stemDocument(a)
a <- removeNumbers(a)
it <- itoken(a, tolower, word_tokenizer)
(voc <- create_vocabulary(it, stopwords = stopwords("en")))
Number of docs: 3 
174 stopwords: i, me, my, myself, we, our ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
     term term_count doc_count
1: friend          1         1
2:   love          3         3

【2】DTM(Document-Term-Matrix)を作成する

(dtm <- create_dtm(it, vocab_vectorizer(voc)))
3 x 2 sparse Matrix of class "dgCMatrix"
  friend love
1      .    1
2      .    1
3      1    1