處理文本數據時經常會遇到這種情況:單條記錄(Record)的各維度數據分散在相鄰的多行中,而非在一行中。Record之間以一相同的數據行作爲分隔標誌,比如...,,,之類的符號。在GNU/Linux中可通過awksed實現多行合併爲一行,即每行爲一個Record。本文記錄具體實現方式和解釋說明。

System Info

操作系統信息

Item Details
OS Debian GNU/Linux 9.5 (stretch)
Kernel 4.9.0-7-amd64

軟件信息

Software Version
awk GNU Awk 4.1.4
sed sed (GNU sed) 4.4

: 在*nix系統中,換行符號默認爲\n

Data Preparation

測試數據準備,創建文件/tmp/data.txt,其內容如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
...
aa
bb
cc
...
Mon
Tue
Wed
...
Jan
Feb
Mar
Asia
Africa
Europe
...
192.168.1.1
192.168.1.2
192.168.1.3
192.168.1.4
...
AxdLog
is my
personal
blog
...

數據行...爲各Record之間的分隔標誌,需將其處理成如下格式

1
2
3
4
5
6
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
AxdLog is my personal blog

Via awk

通過awk實現該需求

Command

以下是操作命令

1
awk '{if($0!~/^[.]+/){ORS=" ";print $0}else{printf "\n"}}' /tmp/data.txt

Operation Procedure

操作過程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
[email protected]:~$ awk '{ if($0!~/^[.]+/){ORS=" ";print $0}else{printf "\n"} }' /tmp/data.txt
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
AxdLog is my personal blog
[email protected]:~$ awk '{ if($0!~/^[.]+/){ORS=",";print $0}else{printf "\n"} }' /tmp/data.txt
1,2,3,4,
aa,bb,cc,
Mon,Tue,Wed,
Jan,Feb,Mar,Asia,Africa,Europe,
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4,
AxdLog,is my,personal,blog ,
[email protected]:~$ awk '{ if($0!~/^[.]+/){ORS="|";print $0}else{printf "\n"} }' /tmp/data.txt
1|2|3|4|
aa|bb|cc|
Mon|Tue|Wed|
Jan|Feb|Mar|Asia|Africa|Europe|
192.168.1.1|192.168.1.2|192.168.1.3|192.168.1.4|
AxdLog|is my|personal|blog |
[email protected]:~$

可以看到,該命令可設置不同的分隔符號,這樣做的好處是可以明顯區分含有空格的維度數據,如AxdLog|is my|personal|blog |,默認的AxdLog is my personal blog則無法區分。

如果要去除每行末尾的|,,可藉助sed實現,比如

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
[email protected]:~$ awk '{ if($0!~/^[.]+/){ORS=",";print $0}else{printf "\n"} }' /tmp/data.txt | sed -r '[email protected],[email protected]@g'
1,2,3,4
aa,bb,cc
Mon,Tue,Wed
Jan,Feb,Mar,Asia,Africa,Europe
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4
AxdLog,is my,personal,blog
[email protected]:~$ awk '{ if($0!~/^[.]+/){ORS="|";print $0}else{printf "\n"} }' /tmp/data.txt | sed -r '[email protected]\|[email protected]@g'
1|2|3|4
aa|bb|cc
Mon|Tue|Wed
Jan|Feb|Mar|Asia|Africa|Europe
192.168.1.1|192.168.1.2|192.168.1.3|192.168.1.4
AxdLog|is my|personal|blog
[email protected]:~$

Explanation

命令解釋

1
awk '{ if($0!~/^[.]+/){ORS=" ";print $0}else{printf "\n"} }' /tmp/data.txt
  1. 通過awk中的條件判斷if進行條件分析,if語句需包裹在'{ }'中;
  2. 參數$0代表整行數據,~代表模式匹配,!代表取反;
  3. /^[.]+/是正則表達式,代表以逗點.開頭,且逗點至少有一個,此正則表達式用於匹配分隔各Record的數據行;
  4. ORS代表 output record seperator (輸出換行符號),默認是\n,故ORS=" "的含義是將輸出換行符更換爲空格(" ");
  5. print $0表示輸出整行數據;
  6. printf "\n"表示輸出換行符號\n

整個命令的含義 即:

  • 通過判斷每一行數據是否爲分隔各Record的數據行:
    • 如果 *不是*,則說明是Record的維度數據,將輸出分隔符號從默認的\n更換爲空格(" ")並將其打印,實現同一Record下各維度數據的拼接;
    • 如果 *是*,則說明是分隔各Record的數據行,需將其刪除或隱藏,再以此位置爲基準,設置各Record之間的換行符\n,通過printf "\n"直接打印換行符\n

最終實現預期效果。

Via sed

通過sed實現

Command

以下是操作命令

1
2
3
sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] [.]+[[:space:]][email protected]\[email protected];' /tmp/data.txt

xargs -a /tmp/data.txt | sed -r '[email protected] [.]+[[:space:]][email protected]\[email protected];'

單純使用sed的命令參考自Command Line Magictwitter

Operation Procedure

操作過程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[email protected]:~$ xargs -a /tmp/data.txt | sed -r '[email protected] [.]+[[:space:]][email protected]\[email protected];'
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
AxdLog is my personal blog

[email protected]:~$ sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] [.]+[[:space:]][email protected]\[email protected];' /tmp/data.txt
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
AxdLog is my personal blog

[email protected]:~$ sed -r ':a;N;$!ba;[email protected]\[email protected]|@g;[email protected][|][.]+[|][email protected]\[email protected];' /tmp/data.txt
1|2|3|4
aa|bb|cc
Mon|Tue|Wed
Jan|Feb|Mar|Asia|Africa|Europe
192.168.1.1|192.168.1.2|192.168.1.3|192.168.1.4
AxdLog|is my|personal|blog

[email protected]:~$ sed -r ':a;N;$!ba;[email protected]\[email protected],@g;[email protected][,][.]+[,][email protected]\[email protected];' /tmp/data.txt
1,2,3,4
aa,bb,cc
Mon,Tue,Wed
Jan,Feb,Mar,Asia,Africa,Europe
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4
AxdLog,is my,personal,blog

[email protected]:~$

使用sed會導致最後多處一行空行,仍可通過sed將其去除

1
2
3
4
5
6
7
8
[email protected]:~$ sed -r ':a;N;$!ba;[email protected]\[email protected],@g;[email protected][,][.]+[,][email protected]\[email protected];' /tmp/data.txt | sed '/^$/d'
1,2,3,4
aa,bb,cc
Mon,Tue,Wed
Jan,Feb,Mar,Asia,Africa,Europe
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4
AxdLog,is my,personal,blog
[email protected]:~$

Explanation

原理與使用awk的思路類似,具體分析參見 How can I replace a newline (\n) using sed?

命令解釋

1
sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] [.]+[[:space:]][email protected]\[email protected];' /tmp/data.txt
  1. 選項-r表示使用擴展性正則(extended regular expressions);
  2. :a表示設置名稱為a的label,之後的b表示無條件判斷自動跳轉到設置的label;
  3. N表示將新讀取的行添加(append)入 pattern space;
  4. $!ba表示如果不是最後一行,則分支(branch)ba跳轉到label a;
  5. [email protected]\[email protected] @g;表示將換行符號\n替換為空格,s表替換,g為flag,表全局;
  6. [email protected] [.]+[[:space:]][email protected]\[email protected];表示將各Record的分隔符替換為換行符\n;

n N Read/append the next line of input into the pattern space.

注意: 使用xargssed的方法弊病很大,只能用空格做默認分隔符,如果使用其它符號做分隔符則無法實現。

Tutorials

以下是sed相關的教程

Bibliography

Sed Tips and Tricks

The Geek Stuff中有一個Sed Tips and Tricks系列教程

  1. Unix Sed Tutorial: Printing File Lines using Address and Patterns
  2. Unix Sed Tutorial: Delete File Lines Using Address and Patterns
  3. Unix Sed Tutorial: Find and Replace Text Inside a File Using RegEx
  4. Unix Sed Tutorial: How To Write to a File Using Sed
  5. Unix Sed Tutorial: How To Execute Multiple Sed Commands
  6. Unix Sed Tutorial: Multi-Line File Operation with 6 Practical Examples
  7. Unix Sed Tutorial: Append, Insert, Replace, and Count File Lines
  8. Unix Sed Tutorial : 7 Examples for Sed Hold and Pattern Buffer Operations
  9. Unix Sed Tutorial: Advanced Sed Substitution Examples
  10. Unix Sed Tutorial: 6 Examples for Sed Branching Operation

Change Logs

  • 2016.09.22 18:48 Thu Asia/Shanghai
    • 初稿完成
  • 2016.11.23 17:05 Wed Asia/Shanghai
  • 2017.01.05 14:26 Thu Asia/Shanghai
    • sed的label使用添加解釋
  • 2018.07.27 14:15:42 Fri America/Boston
    • 勘誤,遷移到新Blog