awk

The awk language is powerful and entire books have been written on this utility alone. However you cannot talk about serious shell programming without talking about awk.

Named after its principled authors, Al Aho, Peter Weinberger, and Brian Kernighan, awk gives you a more flexible solution to combining regular expressions with custom actions versus grep.

grep as its name implies (Global Regular Expression Print) is good at printing either a line or matching portion of a line when one or more regular expressions is matched. What it doesn't do so well is:

  • Print an unmatched portion of a line

  • Perform an action other than print

And without piping one invocation of grep into another, you cannot:

  • AND one or more regular expressions (only OR)

  • Include lines matching one regular expression while excluding lines matching another

awk on the other hand makes those things easy.

Commonly when you need to find a line that matches more than one regular expression, you will see:

1 #!/bin/sh
2 echo abc | grep a | grep b | grep c

This produces the expected output of abc but it is inefficient to send the line through three separate invocations of grep when we can send the line to a single invocation of awk to achieve the same results:

1 #!/bin/sh
2 echo abc | awk '/a/ && /b/ && /c/'

The syntax of the awk language is PREDICATE { ACTION } and if { ACTION } is missing and PREDICATE evaluates to true (non-zero), the line is printed. This makes it easy to translate most grep commands into awk. In the above example, awk is told to print the line (the default { ACTION }) when the line contains at least one a, one b, and one c.

Commonly when you need to find a line that matches one regular expression but excludes another, you will see:

1 #!/bin/sh
2 printf 'doghouse\nbirdhouse\n' | grep house | grep -v dog

This produces the expected output of birdhouse but is inefficient because it sends both lines to each grep when a single invocation of awk can process the stream once to produce the same results:

1 #!/bin/sh
2 printf 'doghouse\nbirdhouse\n' | awk '/house/ && !/dog/'

This table will help you translate grep regular expression syntax to awk regex syntax:

Element

Portable grep

Extended grep

awk

Grouping

\( and \)

( and )

( and )

Quantity 1 or more

\+ or \{1,\}

+ or {1,}

+

Quantity 0 or 1

\? or \{0,1\}

? or {0,1}

?

Quantity N

\{N\}

{N}

Not portable *

Quantity N or less

\{,N\}

{,N}

Not portable *

OR

\∣

Word Bounding

\< and \>

\< and \>

Unsupported *

* Portable awk solution offered below.

The syntax of regular expressions in awk is most closely like that of egrep (or grep -E) except that numeric quantifiers are not supported beyond the basic + and ? for quantities "0 or 1" and "1 or more" respectively.

Despite the fact that {N} {,N} {N,} and {N,N} are unsupported regex in many flavors of awk, they can be implemented with a function.

  1 #!/usr/bin/awk -f
  2 BEGIN {
  3     d = "[[:digit:]]+"
  4     quantifier = sprintf("\\{(%s|%s,|,%s|%s,%s)\\}", d, d, d, d, d)
  5 }
  6 
  7 function quantify(item, qstr,        init, stop, curitem, itemlist, i)
  8 {
  9     init = stop = 0
 10     if (qstr ~ "^" d "$") init = stop = qstr
 11     else
 12     {
 13         if (match(qstr, "^" d ",")) init = substr(qstr, 1, RLENGTH - 1)
 14         if (match(qstr, "," d "$")) stop = substr(qstr, RSTART + 1)
 15     }
 16     curitem = itemlist = ""
 17     for (i = 1; i <= init; i++) curitem = curitem item
 18     if (!stop) itemlist = curitem itemlist "(" item ")+?"
 19     else for (i = init; i <= stop; i++)
 20     {
 21         itemlist = itemlist (itemlist ? "|" : "") curitem
 22         curitem = curitem item
 23     }
 24     return "(" itemlist ")" (init ? "" : "?")
 25 }
 26 
 27 function quantify_chars(re,        head, char, tail, qstr, escaped)
 28 {
 29     head = re
 30     while (match(head, "." quantifier))
 31     {
 32         head = substr(re, 1, RSTART - 1) # text before leading char
 33         char = substr(re, RSTART, 1) # matched character
 34         tail = substr(re, RSTART + RLENGTH) # text after quantifier
 35         qstr = substr(re, RSTART + 2, RLENGTH - 3) # braces pruned
 36 
 37         escaped = 0
 38         if (char == "\\")
 39         {
 40             #
 41             # Odd backslash[es]: `\{2}' not expanded
 42             # Even backslashes: `\\{2}' expanded
 43             #
 44             escaped = (match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2
 45             if (!escaped)
 46             {
 47                 head = substr(head, 1, length(head) - 1)
 48                 char = char char
 49             }
 50         }
 51         else if (char == ")" || char == "]")
 52         {
 53             #
 54             # Odd backslash[es]: `\){2}' expanded
 55             # Even backslashes: `\\){2}' not expanded
 56             #
 57             escaped = (match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2
 58             if (!escaped)
 59             {
 60                 head = substr(head, 1, length(head) - 1)
 61                 char = "\\" char
 62             }
 63         }
 64 
 65         if (!escaped) re = head quantify(char, qstr) tail
 66     }
 67     return re
 68 }
 69 
 70 function quantify_spans(re, open_char, close_char,
 71         greedy, ungreedy, head, item, tail, qstr, escaped,
 72         n, subhead, subitem, subtail)
 73 {
 74     greedy = sprintf("\\%c.*\\%c", open_char, close_char)
 75     ungreedy = sprintf("[^\\%c\\%c]*[\\%c\\%c]$",
 76         open_char, close_char, open_char, close_char)
 77     head = re
 78     while (match(head, greedy quantifier))
 79     {
 80         head = substr(re, 1, RSTART - 1) # text before open_char
 81         item = substr(re, RSTART, RLENGTH) # match with quantifier
 82         tail = substr(re, RSTART + RLENGTH) # text after quantifier
 83         match(item, quantifier "$") # determein quantifier length
 84         qstr = substr(item, RSTART + 1, RLENGTH - 2) # braces pruned
 85         item = substr(item, 1, RSTART - 1) # prune quantifier
 86 
 87         #
 88         # Odd backslash[es]: `[abc\]{2}' not expanded
 89         # Even backslashes: `[abc\\]{2}' expanded
 90         #
 91         escaped = 0
 92         if (match(item, "\\\\+" close_char "$"))
 93             escaped = (RLENGTH - 1) % 2
 94         if (escaped) continue
 95 
 96         #
 97         # Fixup greedy matches: `(abc)(123){2}' -> `(123){2}'
 98         # Handle unblanaced matches: `(abc)){2}' not expanded
 99         #
100         n = 0
101         subhead = item
102         subtail = subitem = ""
103         while (match(subhead, ungreedy))
104         {
105             subhead = substr(item, 1, RSTART - 1)
106             subitem = substr(item, RSTART, rlen = RLENGTH)
107 
108             #
109             # Odd backslash[es] (`\]'): ignore/skip character
110             # Even backslashes (`\\]'): go on to increment level
111             #
112             escaped = 0
113             if (match(subitem, sprintf("\\\\+(\\%s|\\%s)$",
114                 open_char, close_char)))
115                 escaped = (RLENGTH - 1) % 2
116             if (escaped)
117             {
118                 subtail = subitem subtail
119                 continue
120             }
121 
122             #
123             # We are processing right-to-left, so close_char at end
124             # of subitem means increment level and open_char means
125             # to decrement level
126             #
127             if (subitem ~ "\\" close_char "$")
128             {
129                 n++
130                 subtail = subitem subtail
131             }
132             else if (subitem ~ "\\" open_char "$")
133             {
134                 if (--n) subtail = subitem subtail
135                 else
136                 {
137                     subhead = subhead \
138                         substr(subitem, 1, rlen - 1)
139                     subtail = open_char subtail
140                 }
141             }
142             if (!n) break
143         }
144         if (n) continue # open/close characters are unbalanced
145 
146         head = head subhead
147         item = subtail
148         if (item ~ /^\(.*\)$/) # prune parentheses
149             item = substr(item, 2, length(item) - 2)
150         re = head quantify(item, qstr) tail
151     }
152     return re
153 }
154 
155 function qre(re)
156 {
157     # (abc){2} -> (abcabc)
158     # (abc){,2} -> (abc|abcabc)?
159     # (abc){2,} -> (abcabc(abc)+?)
160     # (abc){2,3} -> (abcabc|abcabcabc)
161     re = quantify_spans(re, "(", ")")
162         # NB: Do this first to eliminate extra work since later
163         # expansions below may introduce additional parentheses
164 
165     # [0-9]{2} -> ([0-9][0-9])
166     # [0-9]{,2} -> ([0-9]|[0-9][0-9])?
167     # [0-9]{2,} -> ([0-9][0-9]([0-9])+?)
168     # [0-9]{2,3} -> ([0-9][0-9]|[0-9][0-9][0-9])
169     re = quantify_spans(re, "[", "]")
170 
171     # a{2} -> (aa)
172     # a{,2} -> (a|aa)?
173     # a{2,} -> (aa(a)+?)
174     # a{2,3} -> (aa|aaa)
175     re = quantify_chars(re)
176 
177     return re
178 }
179 
180 # Test code for processing sample regex from stdin or file argument
181 { print $0 " -> " qre($0) }

Despite the fact that \< and \> are unsupported regex in any/all flavors of awk, they can be implemented with a function.

 1 #!/usr/bin/awk -f
 2 function wre(re,        head, repl, tail, rstr)
 3 {
 4     tail = re
 5     while (match(tail, "\\\\[<>]"))
 6     {
 7         head = substr(tail, 1, RSTART - 1) # text before match
 8         repl = substr(tail, RSTART, RLENGTH) # match to replace
 9         tail = substr(tail, RSTART + RLENGTH) # text after match
10         if ((match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2 == 1)
11             repl = substr(repl, 2, 1) == "<" ? \
12                 "(^|[^_[:alnum:]])" : "([^_[:alnum:]]|$)"
13         rstr = rstr head repl
14     }
15     return rstr tail
16 }
17 
18 # Test code for processing sample regex from stdin or file argument
19 { print $0 " -> " wre($0) }

Last updated