今さら言語処理100本ノック #02-bash

popular-names.txtは，アメリカで生まれた赤ちゃんの「名前」「性別」「人数」「年」をタブ区切り形式で格納したファイルである．以下の処理を行うプログラムを作成し，popular-names.txtを入力ファイルとして実行せよ．さらに，同様の処理を UNIX コマンドでも実行し，プログラムの実行結果を確認せよ．

環境

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.4
BuildVersion:   19E266

$ sh --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin19)
Copyright (C) 2007 Free Software Foundation, Inc.

popular-names.txt

$ head -n 5 popular-names.txt
Mary    F       7065    1880
Anna    F       2604    1880
Emma    F       2003    1880
Elizabeth       F       1939    1880
Minnie  F       1746    1880

10. 行数のカウント

行数をカウントせよ．確認には wc コマンドを用いよ．

$ wc -l popular-names.txt
    2780 popular-names.txt

$ more popular-names.txt | wc -l
    2780

参考

【 wc 】コマンド――テキストファイルの文字数や行数を数える

11. タブをスペースに置換

タブ 1 文字につきスペース 1 文字に置換せよ．確認には sed コマンド，tr コマンド，もしくは expand コマンドを用いよ．

sed -e 's/[[:cntrl:]]/\ /g' popular-names.txt > popular-names-11.txt

cat -t popular-names.txt | sed -e "s/\^I/\ /g" > popular-names-11.txt

tr "\t" "\ " < popular-names.txt > popular-names-11.txt

cat popular-names.txt | tr "\t" "\ " > popular-names-11.txt

expand -t 1 popular-names.txt > popular-names-11.txt

popular-names-11.txt

$ head -n 5 popular-names-11.txt
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880

memo

mac の sed でタブを置換できないのでつまづき
cat の–show-tabs オプションでタブを置換してからサイド置換が個人的には良さげ
gsed コマンドをインストールするのもいいと思われる

参考

mac OSX で sed を使って tab の文字列を置換【 tr 】コマンド――テキストファイルの文字を置換する／削除する expand コマンドについて詳しくまとめました【Linux コマンド集】

12. 1 列目を col1.txt に，2 列目を col2.txt に保存 Permalink

各行の 1 列目だけを抜き出したものを col1.txt に，2 列目だけを抜き出したものを col2.txt としてファイルに保存せよ．確認には cut コマンドを用いよ．

$ cut -f 1 popular-names.txt > col1.txt
$ cut -f 2 popular-names.txt > col2.txt

col1.txt, col2.txt

$ head -n 5 col1.txt
Mary
Anna
Emma
Elizabeth
Minnie

$ head -n 5 col2.txt
F
F
F
F
F

memo

とりあえず cut コマンドを動かしてみたらうまくいった

参考

CSV ファイルの特定の列を取り出す

13. col1.txt と col2.txt をマージ

12 で作った col1.txt と col2.txt を結合し，元のファイルの 1 列目と 2 列目をタブ区切りで並べたテキストファイルを作成せよ．確認には paste コマンドを用いよ．

$ paste col1.txt col2.txt > cols.txt

cols.txt

$ head -n 5 cols.txt
Mary    F
Anna    F
Emma    F
Elizabeth       F
Minnie  F

14. 先頭から N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち先頭の N 行だけを表示せよ．確認には head コマンドを用いよ．

# デバックで使っていたので割愛

15. 末尾の N 行を出力

自然数 N をコマンドライン引数などの手段で受け取り，入力のうち末尾の N 行だけを表示せよ．確認には tail コマンドを用いよ．

$ tail -n 5 popular-names.txt
Benjamin        M       13381   2018
Elijah  M       12886   2018
Lucas   M       12585   2018
Mason   M       12435   2018
Logan   M       12352   2018

16. ファイルを N 分割する

自然数 N をコマンドライン引数などの手段で受け取り，入力のファイルを行単位で N 分割せよ．同様の処理を split コマンドで実現せよ．

$ split -l  $(expr $(expr `cat popular-names.txt | wc -l` / 3) + 1) popular-names.txt  popular-names-

$ read chunks && split -l  $(expr $(expr `cat popular-names.txt | wc -l` / $chunks) + 1) popular-names.txt  popular-names-

memo

mac OSX の仕様?で split に n オプションがないのでむりくり実装
行数 % n == 0のときの要素数が違ってくる

$ split -h
split: illegal option -- h
usage: split [-a sufflen] [-b byte_count] [-l line_count] [-p pattern]
             [file [prefix]]

参考

How to split a large text file into smaller files with equal number of lines?

17. １列目の文字列の異なり

1 列目の文字列の種類（異なる文字列の集合）を求めよ．確認には cut, sort, uniq コマンドを用いよ．

$ cut -f 1 popular-names.txt | sort | uniq | head -n 5
Abigail
Aiden
Alexander
Alexis
Alice

18. 各行を 3 コラム目の数値の降順にソート

各行を 3 コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）．確認には sort コマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）．

$ cat -t popular-names.txt | sed -e "s/\^I/\ /g" | sort -r -n -t " " -k 3 | head -n 5
Linda F 99689 1947
Linda F 96211 1948
James M 94757 1947
Michael M 92704 1957
Robert M 91640 1947

memo

r リバース
n 数字順を明示
t 区切り文字
k 列番号

参考

sort コマンドで CSV ファイルをソートする場合はソート列の指定方法に注意

19. 各行の 1 コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の 1 列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認には cut, uniq, sort コマンドを用いよ．

$ cut -f 1 popular-names.txt | sort | uniq -c | sort -r | head -n 5
 118 James
 111 William
 108 Robert
 108 John
  92 Mary