今さら言語処理100本ノック #04

夏目漱石の小説『吾輩は猫である』の文章（[neko.txt](https://nlp100.github.io/data/neko.txt）を MeCab を使って形態素解析し，その結果を neko.txt.mecab というファイルに保存せよ．このファイルを用いて，以下の問に対応するプログラムを実装せよ．

なお，問題 37, 38, 39 は matplotlib もしくは Gnuplot を用いるとよい．

環境

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.4
BuildVersion:   19E266

$ python3 -V
Python 3.6.8

mecab

$ mecab --version
mecab of 0.996

$ more /usr/local/etc/mecabrc
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
;dicdir =  /usr/local/lib/mecab/dic/ipadic
dicdir = /usr/local/lib/mecab/dic/mecab-ipadic-neologd
userdic = /path/to/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

memo

mecab を通すとsurface\tpos,pos1,pos2,pos3,conjugation,form,base,read,spell\nの形に解析結果が出力される

久しぶりに mecab を触ったら以下のエラーが出た。どうやら昔定義したユーザー辞書が見つからないらしい。

なので mecab の設定ファイル(/usr/local/etc/mecabrc(Mac OS X))からuserdic = /path/to/user.dicの 1 行を消してあげれば問題なく動く。

$ mecab
viterbi.cpp(50) [tokenizer_->open(param)] tokenizer.cpp(127) [d->open(dicfile[i])] dictionary.cpp(79) [dmmap_->open(file, mode)] no such file or directory: /path/to/user.dic

30. 形態素解析結果の読み込み)

形態素解析結果（neko.txt.mecab）を読み込むプログラムを実装せよ．ただし，各形態素は表層形（surface），基本形（base），品詞（pos），品詞細分類 1（pos1）をキーとするマッピング型に格納し，1 文を形態素（マッピング型）のリストとして表現せよ．第 4 章の残りの問題では，ここで作ったプログラムを活用せよ．

Download neko.txt

$ curl -Os https://nlp100.github.io/data/neko.txt
$ :> neko.txt.mecab && mecab < neko.txt > neko.txt.mecab
$ more neko.txt.mecab
一      名詞,数,*,*,*,*,一,イチ,イチ
EOS
EOS
　      記号,空白,*,*,*,*,　,　,　
吾輩は猫である  名詞,固有名詞,一般,*,*,*,吾輩は猫である,ワガハイハネコデアル,ワ
ガハイワネコデアル
。      記号,句点,*,*,*,*,。,。,。
EOS
名前    名詞,一般,*,*,*,*,名前,ナマエ,ナマエ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
まだ    副詞,助詞類接続,*,*,*,*,まだ,マダ,マダ
無い    形容詞,自立,*,*,形容詞・アウオ段,基本形,無い,ナイ,ナイ
。      記号,句点,*,*,*,*,。,。,。

import MeCab

def read_mecab(file_name):
    """
    Args:
      file_name (str): path to txt file.

    Returns:
      Generator ([[word_shape]]): 一文毎のword_shapeの配列が返される.
      word_shape (dict): keys=["surface", "base", "pos", "pos1"]
    """
    with open(file_name) as f:
        lines = f.readlines()

    sentence = []
    for line in lines:
        if line == "EOS\n":
            if sentence:
                yield sentence
            sentence = []
            continue

        surface, attr = line.rstrip("\n").split("\t")
        cols = attr.split(",")
        sentence.append({
            'surface': surface,
            'base': cols[6],
            'pos': cols[0],
            'pos1': cols[1]
        })


wss  = list(read_mecab("neko.txt.mecab"))
print(*wss[:3], sep="\n")

[{'surface': '一', 'base': '一', 'pos': '名詞', 'pos1': '数'}]
[{'surface': '\u3000', 'base': '\u3000', 'pos': '記号', 'pos1': '空白'}, {'surface': '吾輩は猫である', 'base': '吾輩は猫である', 'pos': '名詞', 'pos1': '固有名詞'}, {'surface': '。', 'base': '。', 'pos': '記号', 'pos1': '句点'}]
[{'surface': '名前', 'base': '名前', 'pos': '名詞', 'pos1': '一般'}, {'surface': 'は', 'base': 'は', 'pos': '助詞', 'pos1': '係助詞'}, {'surface': 'まだ', 'base': 'まだ', 'pos': '副詞', 'pos1': '助詞類接続'}, {'surface': '無い', 'base': '無い', 'pos': '形容詞', 'pos1': '自立'}, {'surface': '。', 'base': '。', 'pos': '記号', 'pos1': '句点'}]

functions

今回使うプログラムに対応した関数群.

def filter_(sentences, *, f=lambda x: x, target="", flatten=False):
    """辞書型の配列を内包する配列に特化したフィルタ"""
    out = []
    for sentence in sentences:
        filtered = list(filter(f, sentence))
        if target:
            filtered = list(map(lambda x: x.get(target), filtered))
        if flatten:
            out+=filtered
        else:
            out.append(filtered)
    return out

def frequency(sentences):
    """辞書型の配列を内包する配列の単語出現頻度を返す"""
    dic = dict()
    for sentence in sentences:
        for word in sentence:
            surface = word.get("surface")
            if dic.get(surface):
                dic[surface] += 1
            else:
                dic[surface] = 1
    return dic

def top_n(frequency, n=10):
    """単語出現頻度をソートし, 上位n件を返す"""
    return sorted(frequency.items(), key=lambda x: x[1], reverse=True)[:n]

def wc_plot(freq):
    """出現頻度を棒グラフでプロット"""
    plt.bar(range(1, len(freq)+1), list(dict(freq).values()))
    plt.show()

31. 動詞

動詞の表層形をすべて抽出せよ．

verb_surfaces = filter_(wss, f=lambda x: x.get("pos") == "動詞", target="surface", flatten=True)
print(verb_surfaces) #> ['生れ', 'つか', '泣い', 'し', 'いる', '始め', '見', '聞く', ...

32. 動詞の原形

動詞の原形をすべて抽出せよ

verb_bases = filter_(wss, f=lambda x: x.get("pos") == "動詞", target="base", flatten=True)
print(verb_bases) #> ['生れる', 'つく', '泣く', 'する', 'いる', '始める', '見る', '聞く', ...

33. 「A の B」

2 つの名詞が「の」で連結されている名詞句を抽出せよ．

memo

文意的に連体修飾格の「の」でありそうだが, trigram で実装する

参考: のの意味

for ws in wss:
    for i in range(len(ws) - 2):
        n1, no, n2 = ws[i:i+3]
        if no.get("surface") == "の" and\
            n1.get("pos") == "名詞" and\
            n2.get("pos") == "名詞":

            print(f"{n1.get('surface')}の{n2.get('surface')}")

彼の掌
掌の上
書生の顔
はずの顔
顔の真中
穴の中
書生の掌
...

34. 名詞の連接

名詞の連接（連続して出現する名詞）を最長一致で抽出せよ．

def norn_series(word_shapes):
    words = []
    for ws in word_shapes:
        for e in ws:
            if e.get("pos") == "名詞":
                words.append(e.get("surface"))
            else:
                if len(words) > 1:
                    yield "".join(words)
                words = []
ns = list(norn_series(wss))
print(ns) #> ['した所', '人間中', '一番獰悪', '時妙', '一毛', 'その後猫', '一度', ...

35. 単語の出現頻度

foa = frequency(wss)
print(*top_n(foa, 10), sep="\n")

('の', 9101)
('。', 7484)
('、', 6772)
('て', 6697)
('は', 6384)
('に', 6145)
('を', 6068)
('と', 5474)
('が', 5259)
('た', 3916)

品詞毎に分けずに出すので当然助詞が上位を占めている

36. 頻度上位 10 語

出現頻度が高い 10 語とその出現頻度をグラフ（例えば棒グラフなど）で表示せよ．

memo

せっかくなので名詞で出現頻度が高い 10 語とその出現頻度をグラフにする

import matplotlib.pyplot as plt

foa = top_n(frequency(filter_(wss, f=lambda x: x.get("pos")=="名詞")), 10)
print(list(map(lambda x: x[0], foa)))

wc_plot(foa)

37. 「猫」と共起頻度の高い上位 10 語

「猫」とよく共起する（共起頻度が高い）10 語とその出現頻度をグラフ（例えば棒グラフなど）で表示せよ．

memo

共起の定義が不明瞭だが、今回は「猫」と同一文中に出現する物を共起すると定義する
せっかくなので動詞で出現頻度が高い 10 語とその出現頻度をグラフにする

cat_rerations = []
for ws in wss:
    for e in ws:
        if "猫" in  e.get("surface"):
            cat_rerations.append(ws)
            break
cat_rerations = top_n(frequency(filter_(cat_rerations, f=lambda x: x.get("pos")=="動詞")), 10)
print(cat_rerations)
wc_plot(cat_rerations)

38. ヒストグラム

単語の出現頻度のヒストグラム（横軸に出現頻度，縦軸に出現頻度をとる単語の種類数を棒グラフで表したもの）を描け．

arr = list(dict(sorted(foa.items(), key=lambda x: x[1], reverse=True)).values())
plt.hist(arr, bins=100, range=(0, 100))
plt.show()

39. Zipf の法則

plt.plot(range(1, len(arr)+1), arr,marker='.')
plt.xscale('log')
plt.yscale('log')
plt.show()