# awk

The `awk` language is powerful and entire books have been written on this utility alone. However you cannot talk about serious shell programming without talking about `awk`.

Named after its principled authors, Al Aho, Peter Weinberger, and Brian Kernighan, `awk` gives you a more flexible solution to combining regular expressions with custom actions versus `grep`.

`grep` as its name implies (Global Regular Expression Print) is good at printing either a line or matching portion of a line when one or more regular expressions is matched. What it doesn't do so well is:

* Print an unmatched portion of a line
* Perform an action other than print

And without piping one invocation of `grep` into another, you cannot:

* AND one or more regular expressions (only OR)
* Include lines matching one regular expression while excluding lines matching another

`awk` on the other hand makes those things easy.

Commonly when you need to find a line that matches more than one regular expression, you will see:

```
1 #!/bin/sh
2 echo abc | grep a | grep b | grep c
```

This produces the expected output of `abc` but it is inefficient to send the line through three separate invocations of `grep` when we can send the line to a single invocation of awk to achieve the same results:

```
1 #!/bin/sh
2 echo abc | awk '/a/ && /b/ && /c/'
```

The syntax of the `awk` language is `PREDICATE { ACTION }` and if `{ ACTION }` is missing and `PREDICATE` evaluates to true (non-zero), the line is printed. This makes it easy to translate most `grep` commands into `awk`. In the above example, `awk` is told to print the line (the default `{ ACTION }`) when the line contains at least one `a`, one `b`, and one `c`.

Commonly when you need to find a line that matches one regular expression but excludes another, you will see:

```
1 #!/bin/sh
2 printf 'doghouse\nbirdhouse\n' | grep house | grep -v dog
```

This produces the expected output of `birdhouse` but is inefficient because it sends both lines to each `grep` when a single invocation of `awk` can process the stream once to produce the same results:

```
1 #!/bin/sh
2 printf 'doghouse\nbirdhouse\n' | awk '/house/ && !/dog/'
```

This table will help you translate `grep` regular expression syntax to `awk` regex syntax:

| Element            | Portable grep     | Extended grep  | awk             |
| ------------------ | ----------------- | -------------- | --------------- |
| Grouping           | `\(` and `\)`     | `(` and `)`    | `(` and `)`     |
| Quantity 1 or more | `\+` or `\{1,\}`  | `+` or `{1,}`  | `+`             |
| Quantity 0 or 1    | `\?` or `\{0,1\}` | `?` or `{0,1}` | `?`             |
| Quantity N         | `\{N\}`           | `{N}`          | Not portable \* |
| Quantity N or less | `\{,N\}`          | `{,N}`         | Not portable \* |
| OR                 | `\∣`              | `∣`            | `∣`             |
| Word Bounding      | `\<` and `\>`     | `\<` and `\>`  | Unsupported \*  |

\* Portable `awk` solution offered below.

The syntax of regular expressions in `awk` is most closely like that of `egrep` (or `grep -E`) except that numeric quantifiers are not supported beyond the basic `+` and `?` for quantities "0 or 1" and "1 or more" respectively.

Despite the fact that `{N}` `{,N}` `{N,}` and `{N,N}` are unsupported regex in many flavors of `awk`, they can be implemented with a function.

```
  1 #!/usr/bin/awk -f
  2 BEGIN {
  3     d = "[[:digit:]]+"
  4     quantifier = sprintf("\\{(%s|%s,|,%s|%s,%s)\\}", d, d, d, d, d)
  5 }
  6 
  7 function quantify(item, qstr,        init, stop, curitem, itemlist, i)
  8 {
  9     init = stop = 0
 10     if (qstr ~ "^" d "$") init = stop = qstr
 11     else
 12     {
 13         if (match(qstr, "^" d ",")) init = substr(qstr, 1, RLENGTH - 1)
 14         if (match(qstr, "," d "$")) stop = substr(qstr, RSTART + 1)
 15     }
 16     curitem = itemlist = ""
 17     for (i = 1; i <= init; i++) curitem = curitem item
 18     if (!stop) itemlist = curitem itemlist "(" item ")+?"
 19     else for (i = init; i <= stop; i++)
 20     {
 21         itemlist = itemlist (itemlist ? "|" : "") curitem
 22         curitem = curitem item
 23     }
 24     return "(" itemlist ")" (init ? "" : "?")
 25 }
 26 
 27 function quantify_chars(re,        head, char, tail, qstr, escaped)
 28 {
 29     head = re
 30     while (match(head, "." quantifier))
 31     {
 32         head = substr(re, 1, RSTART - 1) # text before leading char
 33         char = substr(re, RSTART, 1) # matched character
 34         tail = substr(re, RSTART + RLENGTH) # text after quantifier
 35         qstr = substr(re, RSTART + 2, RLENGTH - 3) # braces pruned
 36 
 37         escaped = 0
 38         if (char == "\\")
 39         {
 40             #
 41             # Odd backslash[es]: `\{2}' not expanded
 42             # Even backslashes: `\\{2}' expanded
 43             #
 44             escaped = (match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2
 45             if (!escaped)
 46             {
 47                 head = substr(head, 1, length(head) - 1)
 48                 char = char char
 49             }
 50         }
 51         else if (char == ")" || char == "]")
 52         {
 53             #
 54             # Odd backslash[es]: `\){2}' expanded
 55             # Even backslashes: `\\){2}' not expanded
 56             #
 57             escaped = (match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2
 58             if (!escaped)
 59             {
 60                 head = substr(head, 1, length(head) - 1)
 61                 char = "\\" char
 62             }
 63         }
 64 
 65         if (!escaped) re = head quantify(char, qstr) tail
 66     }
 67     return re
 68 }
 69 
 70 function quantify_spans(re, open_char, close_char,
 71         greedy, ungreedy, head, item, tail, qstr, escaped,
 72         n, subhead, subitem, subtail)
 73 {
 74     greedy = sprintf("\\%c.*\\%c", open_char, close_char)
 75     ungreedy = sprintf("[^\\%c\\%c]*[\\%c\\%c]$",
 76         open_char, close_char, open_char, close_char)
 77     head = re
 78     while (match(head, greedy quantifier))
 79     {
 80         head = substr(re, 1, RSTART - 1) # text before open_char
 81         item = substr(re, RSTART, RLENGTH) # match with quantifier
 82         tail = substr(re, RSTART + RLENGTH) # text after quantifier
 83         match(item, quantifier "$") # determein quantifier length
 84         qstr = substr(item, RSTART + 1, RLENGTH - 2) # braces pruned
 85         item = substr(item, 1, RSTART - 1) # prune quantifier
 86 
 87         #
 88         # Odd backslash[es]: `[abc\]{2}' not expanded
 89         # Even backslashes: `[abc\\]{2}' expanded
 90         #
 91         escaped = 0
 92         if (match(item, "\\\\+" close_char "$"))
 93             escaped = (RLENGTH - 1) % 2
 94         if (escaped) continue
 95 
 96         #
 97         # Fixup greedy matches: `(abc)(123){2}' -> `(123){2}'
 98         # Handle unblanaced matches: `(abc)){2}' not expanded
 99         #
100         n = 0
101         subhead = item
102         subtail = subitem = ""
103         while (match(subhead, ungreedy))
104         {
105             subhead = substr(item, 1, RSTART - 1)
106             subitem = substr(item, RSTART, rlen = RLENGTH)
107 
108             #
109             # Odd backslash[es] (`\]'): ignore/skip character
110             # Even backslashes (`\\]'): go on to increment level
111             #
112             escaped = 0
113             if (match(subitem, sprintf("\\\\+(\\%s|\\%s)$",
114                 open_char, close_char)))
115                 escaped = (RLENGTH - 1) % 2
116             if (escaped)
117             {
118                 subtail = subitem subtail
119                 continue
120             }
121 
122             #
123             # We are processing right-to-left, so close_char at end
124             # of subitem means increment level and open_char means
125             # to decrement level
126             #
127             if (subitem ~ "\\" close_char "$")
128             {
129                 n++
130                 subtail = subitem subtail
131             }
132             else if (subitem ~ "\\" open_char "$")
133             {
134                 if (--n) subtail = subitem subtail
135                 else
136                 {
137                     subhead = subhead \
138                         substr(subitem, 1, rlen - 1)
139                     subtail = open_char subtail
140                 }
141             }
142             if (!n) break
143         }
144         if (n) continue # open/close characters are unbalanced
145 
146         head = head subhead
147         item = subtail
148         if (item ~ /^\(.*\)$/) # prune parentheses
149             item = substr(item, 2, length(item) - 2)
150         re = head quantify(item, qstr) tail
151     }
152     return re
153 }
154 
155 function qre(re)
156 {
157     # (abc){2} -> (abcabc)
158     # (abc){,2} -> (abc|abcabc)?
159     # (abc){2,} -> (abcabc(abc)+?)
160     # (abc){2,3} -> (abcabc|abcabcabc)
161     re = quantify_spans(re, "(", ")")
162         # NB: Do this first to eliminate extra work since later
163         # expansions below may introduce additional parentheses
164 
165     # [0-9]{2} -> ([0-9][0-9])
166     # [0-9]{,2} -> ([0-9]|[0-9][0-9])?
167     # [0-9]{2,} -> ([0-9][0-9]([0-9])+?)
168     # [0-9]{2,3} -> ([0-9][0-9]|[0-9][0-9][0-9])
169     re = quantify_spans(re, "[", "]")
170 
171     # a{2} -> (aa)
172     # a{,2} -> (a|aa)?
173     # a{2,} -> (aa(a)+?)
174     # a{2,3} -> (aa|aaa)
175     re = quantify_chars(re)
176 
177     return re
178 }
179 
180 # Test code for processing sample regex from stdin or file argument
181 { print $0 " -> " qre($0) }
```

Despite the fact that `\<` and `\>` are unsupported regex in any/all flavors of `awk`, they can be implemented with a function.

```
 1 #!/usr/bin/awk -f
 2 function wre(re,        head, repl, tail, rstr)
 3 {
 4     tail = re
 5     while (match(tail, "\\\\[<>]"))
 6     {
 7         head = substr(tail, 1, RSTART - 1) # text before match
 8         repl = substr(tail, RSTART, RLENGTH) # match to replace
 9         tail = substr(tail, RSTART + RLENGTH) # text after match
10         if ((match(head, /\\+$/) ? RLENGTH + 1 : 1) % 2 == 1)
11             repl = substr(repl, 2, 1) == "<" ? \
12                 "(^|[^_[:alnum:]])" : "([^_[:alnum:]]|$)"
13         rstr = rstr head repl
14     }
15     return rstr tail
16 }
17 
18 # Test code for processing sample regex from stdin or file argument
19 { print $0 " -> " wre($0) }
```
