awk

The awk language is powerful and entire books have been written on this utility alone. However you cannot talk about serious shell programming without talking about awk.

Named after its principled authors, Al Aho, Peter Weinberger, and Brian Kernighan, awk gives you a more flexible solution to combining regular expressions with custom actions versus grep.

grep as its name implies (Global Regular Expression Print) is good at printing either a line or matching portion of a line when one or more regular expressions is matched. What it doesn't do so well is:

  • Print an unmatched portion of a line

  • Perform an action other than print

And without piping one invocation of grep into another, you cannot:

  • AND one or more regular expressions (only OR)

  • Include lines matching one regular expression while excluding lines matching another

awk on the other hand makes those things easy.

Commonly when you need to find a line that matches more than one regular expression, you will see:

1 #!/bin/sh
2 echo abc | grep a | grep b | grep c

This produces the expected output of abc but it is inefficient to send the line through three separate invocations of grep when we can send the line to a single invocation of awk to achieve the same results:

1 #!/bin/sh
2 echo abc | awk '/a/ && /b/ && /c/'

The syntax of the awk language is PREDICATE { ACTION } and if { ACTION } is missing and PREDICATE evaluates to true (non-zero), the line is printed. This makes it easy to translate most grep commands into awk. In the above example, awk is told to print the line (the default { ACTION }) when the line contains at least one a, one b, and one c.

Commonly when you need to find a line that matches one regular expression but excludes another, you will see:

1 #!/bin/sh
2 printf 'doghouse\nbirdhouse\n' | grep house | grep -v dog

This produces the expected output of birdhouse but is inefficient because it sends both lines to each grep when a single invocation of awk can process the stream once to produce the same results:

1 #!/bin/sh
2 printf 'doghouse\nbirdhouse\n' | awk '/house/ && !/dog/'

This table will help you translate grep regular expression syntax to awk regex syntax:

Element

Portable grep

Extended grep

awk

Grouping

\( and \)

( and )

( and )

Quantity 1 or more

\+ or \{1,\}

+ or {1,}

+

Quantity 0 or 1

\? or \{0,1\}

? or {0,1}

?

Quantity N

\{N\}

{N}

Not portable *

Quantity N or less

\{,N\}

{,N}

Not portable *

OR

\∣

Word Bounding

\< and \>

\< and \>

Unsupported *

* Portable awk solution offered below.

The syntax of regular expressions in awk is most closely like that of egrep (or grep -E) except that numeric quantifiers are not supported beyond the basic + and ? for quantities "0 or 1" and "1 or more" respectively.

Despite the fact that {N} {,N} {N,} and {N,N} are unsupported regex in many flavors of awk, they can be implemented with a function.

1 #!/usr/bin/awk -f
2 BEGIN {
3 d = "[[:digit:]]+"
4 quantifier = sprintf("\\{(%s|%s,|,%s|%s,%s)\\}", d, d, d, d, d)
5 }
6
7 function quantify(item, qstr, init, stop, curitem, itemlist, i)
8 {
9 init = stop = 0
10 if (qstr ~ "^" d "$") init = stop = qstr
11 else
12 {
13 if (match(qstr, "^" d ",")) init = substr(qstr, 1, RLENGTH - 1)
14 if (match(qstr, "," d "$")) stop = substr(qstr, RSTART + 1)
15 }
16 curitem = itemlist = ""
17 for (i = 1; i <= init; i++) curitem = curitem item
18 if (!stop) itemlist = curitem itemlist "(" item ")+?"
19 else for (i = init; i <= stop; i++)
20 {
21 itemlist = itemlist (itemlist ? "|" : "") curitem
22 curitem = curitem item
23 }
24 return "(" itemlist ")" (init ? "" : "?")
25 }
26
27 function quantify_chars(re, head, char, tail, qstr, escaped)
28 {
29 head = re
30 while (match(head, "." quantifier))
31 {
32 head = substr(re, 1, RSTART - 1) # text before leading char
33 char = substr(re, RSTART, 1) # matched character
34 tail = substr(re, RSTART + RLENGTH) # text after quantifier
35 qstr = substr(re, RSTART + 2, RLENGTH - 3) # braces pruned
36
37 escaped = 0
38 if (char == "\\")
39 {
40 #
41 # Odd backslash[es]: `\{2}' not expanded
42 # Even backslashes: `\\{2}' expanded
43 #
44 escaped = (match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2
45 if (!escaped)
46 {
47 head = substr(head, 1, length(head) - 1)
48 char = char char
49 }
50 }
51 else if (char == ")" || char == "]")
52 {
53 #
54 # Odd backslash[es]: `\){2}' expanded
55 # Even backslashes: `\\){2}' not expanded
56 #
57 escaped = (match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2
58 if (!escaped)
59 {
60 head = substr(head, 1, length(head) - 1)
61 char = "\\" char
62 }
63 }
64
65 if (!escaped) re = head quantify(char, qstr) tail
66 }
67 return re
68 }
69
70 function quantify_spans(re, open_char, close_char,
71 greedy, ungreedy, head, item, tail, qstr, escaped,
72 n, subhead, subitem, subtail)
73 {
74 greedy = sprintf("\\%c.*\\%c", open_char, close_char)
75 ungreedy = sprintf("[^\\%c\\%c]*[\\%c\\%c]$",
76 open_char, close_char, open_char, close_char)
77 head = re
78 while (match(head, greedy quantifier))
79 {
80 head = substr(re, 1, RSTART - 1) # text before open_char
81 item = substr(re, RSTART, RLENGTH) # match with quantifier
82 tail = substr(re, RSTART + RLENGTH) # text after quantifier
83 match(item, quantifier "$") # determein quantifier length
84 qstr = substr(item, RSTART + 1, RLENGTH - 2) # braces pruned
85 item = substr(item, 1, RSTART - 1) # prune quantifier
86
87 #
88 # Odd backslash[es]: `[abc\]{2}' not expanded
89 # Even backslashes: `[abc\\]{2}' expanded
90 #
91 escaped = 0
92 if (match(item, "\\\\+" close_char "$"))
93 escaped = (RLENGTH - 1) % 2
94 if (escaped) continue
95
96 #
97 # Fixup greedy matches: `(abc)(123){2}' -> `(123){2}'
98 # Handle unblanaced matches: `(abc)){2}' not expanded
99 #
100 n = 0
101 subhead = item
102 subtail = subitem = ""
103 while (match(subhead, ungreedy))
104 {
105 subhead = substr(item, 1, RSTART - 1)
106 subitem = substr(item, RSTART, rlen = RLENGTH)
107
108 #
109 # Odd backslash[es] (`\]'): ignore/skip character
110 # Even backslashes (`\\]'): go on to increment level
111 #
112 escaped = 0
113 if (match(subitem, sprintf("\\\\+(\\%s|\\%s)$",
114 open_char, close_char)))
115 escaped = (RLENGTH - 1) % 2
116 if (escaped)
117 {
118 subtail = subitem subtail
119 continue
120 }
121
122 #
123 # We are processing right-to-left, so close_char at end
124 # of subitem means increment level and open_char means
125 # to decrement level
126 #
127 if (subitem ~ "\\" close_char "$")
128 {
129 n++
130 subtail = subitem subtail
131 }
132 else if (subitem ~ "\\" open_char "$")
133 {
134 if (--n) subtail = subitem subtail
135 else
136 {
137 subhead = subhead \
138 substr(subitem, 1, rlen - 1)
139 subtail = open_char subtail
140 }
141 }
142 if (!n) break
143 }
144 if (n) continue # open/close characters are unbalanced
145
146 head = head subhead
147 item = subtail
148 if (item ~ /^\(.*\)$/) # prune parentheses
149 item = substr(item, 2, length(item) - 2)
150 re = head quantify(item, qstr) tail
151 }
152 return re
153 }
154
155 function qre(re)
156 {
157 # (abc){2} -> (abcabc)
158 # (abc){,2} -> (abc|abcabc)?
159 # (abc){2,} -> (abcabc(abc)+?)
160 # (abc){2,3} -> (abcabc|abcabcabc)
161 re = quantify_spans(re, "(", ")")
162 # NB: Do this first to eliminate extra work since later
163 # expansions below may introduce additional parentheses
164
165 # [0-9]{2} -> ([0-9][0-9])
166 # [0-9]{,2} -> ([0-9]|[0-9][0-9])?
167 # [0-9]{2,} -> ([0-9][0-9]([0-9])+?)
168 # [0-9]{2,3} -> ([0-9][0-9]|[0-9][0-9][0-9])
169 re = quantify_spans(re, "[", "]")
170
171 # a{2} -> (aa)
172 # a{,2} -> (a|aa)?
173 # a{2,} -> (aa(a)+?)
174 # a{2,3} -> (aa|aaa)
175 re = quantify_chars(re)
176
177 return re
178 }
179
180 # Test code for processing sample regex from stdin or file argument
181 { print $0 " -> " qre($0) }

Despite the fact that \< and \> are unsupported regex in any/all flavors of awk, they can be implemented with a function.

1 #!/usr/bin/awk -f
2 function wre(re, head, repl, tail, rstr)
3 {
4 tail = re
5 while (match(tail, "\\\\[<>]"))
6 {
7 head = substr(tail, 1, RSTART - 1) # text before match
8 repl = substr(tail, RSTART, RLENGTH) # match to replace
9 tail = substr(tail, RSTART + RLENGTH) # text after match
10 if ((match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2 == 1)
11 repl = substr(repl, 2, 1) == "<" ? \
12 "(^|[^_[:alnum:]])" : "([^_[:alnum:]]|$)"
13 rstr = rstr head repl
14 }
15 return rstr tail
16 }
17
18 # Test code for processing sample regex from stdin or file argument
19 { print $0 " -> " wre($0) }