Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


HPR2184: Gnu Awk - Part 5

Hosted by Mr. Young on 2016-12-15 00:00:00
Download or Listen

GNU AWK - Part 5

Regular Expressions in AWK

The syntax for using regular expressions to match lines in AWK is as follows:

word ~ /match/

Or for not matching, use the following:

word !~ /match/

Remember the following file from the previous episodes:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

We can run the following command:

$1 ~ /p[elu]/ {print $0}

We will get the following output:

apple      red    4
grape      purple 10
apple      green  8
plum       purple 2
pineapple  yellow 5

In another example:

$2 ~ /e{2}/ {print $0}

Will produce the output:

apple      green  8

Regular expression basics

Certain characters have special meaning when using regular expressions.

Anchors

  • ^ - beginning of the line
  • $ - end of the line
  • \A - beginning of a string
  • \z - end of a string
  • \b on a word boundary

Characters

  • [ad] - a or d
  • [a-d] - any character a through d
  • [^a-d] - not any character a through d
  • \w - any word
  • \s - any white-space character
  • \d - any digit

The capital version of w, s, and d are negations.

Or, you can reference characters the POSIX standard way:

  • [:alnum:] - Alphanumeric characters
  • [:alpha:] - Alphabetic characters
  • [:blank:] - Space and TAB characters
  • [:cntrl:] - Control characters
  • [:digit:] - Numeric characters
  • [:graph:] - Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
  • [:lower:] - Lowercase alphabetic characters
  • [:print:] - Printable characters (characters that are not control characters)
  • [:punct:] - Punctuation characters (characters that are not letters, digits, control characters, or space characters)
  • [:space:] - Space characters (such as space, TAB, and formfeed, to name a few)
  • [:upper:] - Uppercase alphabetic characters
  • [:xdigit:] - Characters that are hexadecimal digits

Quantifiers

  • . - match any character
  • + - match preceding one or more times
  • * - match preceding zero or more times
  • ? - match preceding zero or one time
  • {n} - match preceding exactly n times
  • {n,} - match preceding n or more times
  • {n,m} - match preceding between n and m times

Grouped Matches

  • (...) - Parentheses are used for grouping
  • | - Means or in the context of a grouped match

Replacement

  • The sub command substitutes the match with the replacement string. This only applies to the first match.
  • The gsub command substitutes all matching items.
  • The gensub command command substitutes the in a similar way as sub and gsub, but with extra functionality
  • The & character in the replacement field references the matched text. You have to use \& to replace the match with the literal & character.

Example:

{ sub(/apple/, "nut", $1);
    print $1}

The output is:

name
nut
banana
strawberry
grape
nut
plum
kiwi
potato
pinenut

Another example:

{ sub(/.+(pp|rr)/, "test-&", $1);
    print $1}

This produces the following output:

name
test-apple
banana
test-strawberry
grape
test-apple
plum
kiwi
potato
test-pineapple

Resources

Comments



More Information...


Copyright Information

Unless otherwise stated, our shows are released under a Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

The HPR Website Design is released to the Public Domain.