Linux Regex: Master grep, sed, awk Patterns

The most surprising thing about regular expressions in Linux is how often they’re used for tasks you’d never expect, making them feel less like a text-processing tool and more like a universal language for pattern matching.

Let’s see grep in action. Imagine you have a log file, app.log, and you want to find all lines that contain either "ERROR" or "WARNING", but not "DEBUG".

grep -E '^(?=.*(ERROR|WARNING))(?!.*DEBUG).*$' app.log

Here’s what’s happening:

  • -E enables extended regular expressions, which gives us | for OR and (?...) for lookarounds.
  • ^ anchors the match to the start of the line.
  • (?=.*(ERROR|WARNING)) is a positive lookahead. It asserts that somewhere on the line (.*) there’s either "ERROR" or "WARNING". Crucially, it doesn’t consume any characters, so the regex engine can continue from the start of the line.
  • (?!.*DEBUG) is a negative lookahead. It asserts that nowhere on the line (.*) is there "DEBUG". Again, it doesn’t consume characters.
  • .* then matches the entire line if both lookaheads are successful.
  • $ anchors the match to the end of the line.

This grep command effectively filters your logs, showing you only the critical messages without the noise.

Now, let’s consider sed. sed (stream editor) is fantastic for making substitutions. Suppose you have a configuration file, config.ini, and you want to change all occurrences of port=8080 to port=8000, but only if the line doesn’t start with a # (meaning it’s not a comment).

sed '/^#/! s/port=8080/port=8000/' config.ini

Let’s break this down:

  • sed operates line by line.
  • '/^#/’ is an address. It matches lines that start with #.
  • ! negates the address. So, /^#/! means "for lines that do not start with #".
  • s/port=8080/port=8000/ is the substitution command. It finds the literal string port=8080 and replaces it with port=8000.
  • The substitution only happens on lines that passed the '/^#/!' check.

This ensures you’re only modifying active configuration settings, not commented-out ones.

awk is the powerhouse for structured data. Imagine you have a CSV file, data.csv, with columns for ID, Name, and Value. You want to print the Name and Value for all rows where Value is greater than 100.

awk -F',' '$3 > 100 { print $2, $3 }' data.csv

Here’s the awk magic:

  • -F',' tells awk that the input fields are separated by commas.
  • $3 > 100 is the condition. $3 refers to the third field (which is Value in our example). If the value in the third field is numerically greater than 100, the condition is true.
  • { print $2, $3 } is the action. If the condition is true, awk prints the second field (Name) and the third field (Value), separated by a space (the default output field separator).

This awk command allows you to selectively extract and display data based on numerical comparisons, which is incredibly useful for analyzing tabular data.

A common pitfall is misunderstanding how awk treats fields. By default, awk treats fields as strings for pattern matching and as numbers for arithmetic operations. This automatic type coercion is usually a convenience, but it can lead to unexpected behavior if you’re not careful. For instance, if you have a field that looks like 007, awk might treat it as the number 7 in a numerical comparison, which might not be what you intended if you were expecting string-based sorting or comparison.

The next step in mastering these tools is understanding how to combine them for more complex workflows, like piping the output of grep into sed or awk for further processing.

Want structured learning?

Take the full Linux & Systems Programming course →