今さら言語処理100本ノック #02-python

popular-names.txtは，アメリカで生まれた赤ちゃんの「名前」「性別」「人数」「年」をタブ区切り形式で格納したファイルである．以下の処理を行うプログラムを作成し，popular-names.txtを入力ファイルとして実行せよ．さらに，同様の処理を UNIX コマンドでも実行し，プログラムの実行結果を確認せよ．

環境

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.4
BuildVersion:   19E266

$ python3 -V
Python 3.6.8

popular-names.txt

$ head -n 5 popular-names.txt
Mary    F       7065    1880
Anna    F       2604    1880
Emma    F       2003    1880
Elizabeth       F       1939    1880
Minnie  F       1746    1880

10. 行数のカウント

行数をカウントせよ．確認には wc コマンドを用いよ．

with open("popular-names.txt") as f:
    print(len(f.readlines()))

>> 2780

11. タブをスペースに置換

タブ 1 文字につきスペース 1 文字に置換せよ．確認には sed コマンド，tr コマンド，もしくは expand コマンドを用いよ．

with open("popular-names.txt") as f:
    text = f.read()

replaced_text = text.replace("\t", " ")
print(replaced_text)

12. 1 列目を col1.txt に，2 列目を col2.txt に保存 Permalink

各行の 1 列目だけを抜き出したものを col1.txt に，2 列目だけを抜き出したものを col2.txt としてファイルに保存せよ．確認には cut コマンドを用いよ．

with open("popular-names.txt") as f:
    lines = f.readlines()

col1 = list(map(lambda x: x.split("\t")[0], lines))
col2 = list(map(lambda x: x.split("\t")[1], lines))

with open("col1-py.txt", "w") as f:
    f.write("\n".join(col1) + "\n")

with open("col2-py.txt", "w") as f:
    f.write("\n".join(col2) + "\n")

13. col1.txt と col2.txt をマージ

12 で作った col1.txt と col2.txt を結合し，元のファイルの 1 列目と 2 列目をタブ区切りで並べたテキストファイルを作成せよ．確認には paste コマンドを用いよ．

with open("col1-py.txt") as f:
    col1 = f.read().split("\n")

with open("col2-py.txt") as f:
    col2 = f.read().split("\n")

lines = [f"{x}\t{y}" for x, y in zip(col1, col2)]
with open("cols-py.txt", "w") as f:
    f.write("\n".join(lines))

14. 先頭から N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち先頭の N 行だけを表示せよ．確認には head コマンドを用いよ．

N = int(input())
with open("popular-names.txt") as f:
    lines = f.readlines()
    print(*lines[:N], sep="")

15. 末尾の N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち末尾の N 行だけを表示せよ．確認には tail コマンドを用いよ．

N = int(input())
with open("popular-names.txt") as f:
    lines = f.readlines()
    print(*lines[-N:], sep="")

16. ファイルを N 分割する

自然数 N をコマンドライン引数などの手段で受け取り，入力のファイルを行単位で N 分割せよ．同様の処理を split コマンドで実現せよ．

import numpy as np

with open("popular-names.txt") as f:
    lines = f.readlines()

N = int(input())
idx = list(map(int, np.linspace(0, len(lines), N + 1)))
files = [lines[idx[i]:idx[i+1]] for i in range(N)]
print(*files, sep="\n")

memo

bash の方とは違う実装をしている
小数点の扱いがざる

17. １列目の文字列の異なり

1 列目の文字列の種類（異なる文字列の集合）を求めよ．確認には cut, sort, uniq コマンドを用いよ．

with open("popular-names.txt") as f:
    lines = f.readlines()

uniq_names = set([line.split("\t")[0] for line in lines])
print(*sorted(uniq_names), sep="\n")

18. 各行を 3 コラム目の数値の降順にソート

各行を 3 コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）．確認には sort コマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）．

with open("popular-names.txt") as f:
    lines = f.readlines()

sorted_lines = sorted(lines, key=lambda x: int(x.split("\t")[2]), reverse=True)
print(*sorted_lines, sep="")

19. 各行の 1 コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の 1 列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認には cut, uniq, sort コマンドを用いよ．

with open("popular-names.txt") as f:
    lines = f.readlines()

col1_count = dict()
for l in lines:
    c = l.split("\t")[0]
    if col1_count.get(c):
        col1_count[c]+=1
    else:
        col1_count[c] = 1
sorted_col1 = sorted(col1_count.items(), key=lambda x: x[1], reverse=True)
print(*sorted_col1, sep="\n")

memo