Linux Regex: Master grep, sed, awk Patterns
The most surprising thing about regular expressions in Linux is how often they’re used for tasks you’d never expect, making them feel less like a text-processing tool and more like a universal language for pattern matching.
Let’s see grep in action. Imagine you have a log file, app.log, and you want to find all lines that contain either "ERROR" or "WARNING", but not "DEBUG".
grep -E '^(?=.*(ERROR|WARNING))(?!.*DEBUG).*$' app.log
Here’s what’s happening:
-Eenables extended regular expressions, which gives us|for OR and(?...)for lookarounds.^anchors the match to the start of the line.(?=.*(ERROR|WARNING))is a positive lookahead. It asserts that somewhere on the line (.*) there’s either "ERROR" or "WARNING". Crucially, it doesn’t consume any characters, so the regex engine can continue from the start of the line.(?!.*DEBUG)is a negative lookahead. It asserts that nowhere on the line (.*) is there "DEBUG". Again, it doesn’t consume characters..*then matches the entire line if both lookaheads are successful.$anchors the match to the end of the line.
This grep command effectively filters your logs, showing you only the critical messages without the noise.
Now, let’s consider sed. sed (stream editor) is fantastic for making substitutions. Suppose you have a configuration file, config.ini, and you want to change all occurrences of port=8080 to port=8000, but only if the line doesn’t start with a # (meaning it’s not a comment).
sed '/^#/! s/port=8080/port=8000/' config.ini
Let’s break this down:
sedoperates line by line.'/^#/’ is an address. It matches lines that start with#.!negates the address. So,/^#/!means "for lines that do not start with#".s/port=8080/port=8000/is the substitution command. It finds the literal stringport=8080and replaces it withport=8000.- The substitution only happens on lines that passed the
'/^#/!'check.
This ensures you’re only modifying active configuration settings, not commented-out ones.
awk is the powerhouse for structured data. Imagine you have a CSV file, data.csv, with columns for ID, Name, and Value. You want to print the Name and Value for all rows where Value is greater than 100.
awk -F',' '$3 > 100 { print $2, $3 }' data.csv
Here’s the awk magic:
-F','tellsawkthat the input fields are separated by commas.$3 > 100is the condition.$3refers to the third field (which isValuein our example). If the value in the third field is numerically greater than 100, the condition is true.{ print $2, $3 }is the action. If the condition is true,awkprints the second field (Name) and the third field (Value), separated by a space (the default output field separator).
This awk command allows you to selectively extract and display data based on numerical comparisons, which is incredibly useful for analyzing tabular data.
A common pitfall is misunderstanding how awk treats fields. By default, awk treats fields as strings for pattern matching and as numbers for arithmetic operations. This automatic type coercion is usually a convenience, but it can lead to unexpected behavior if you’re not careful. For instance, if you have a field that looks like 007, awk might treat it as the number 7 in a numerical comparison, which might not be what you intended if you were expecting string-based sorting or comparison.
The next step in mastering these tools is understanding how to combine them for more complex workflows, like piping the output of grep into sed or awk for further processing.