Previous: awk split records, Up: Records [Contents][Index]
gawk
When using gawk
, the value of RS
is not limited to a
one-character string. If it contains more than one character, it is
treated as a regular expression
(see section Regular Expressions). (c.e.)
In general, each record
ends at the next string that matches the regular expression; the next
record starts at the end of the matching string. This general rule is
actually at work in the usual case, where RS
contains just a
newline: a record ends at the beginning of the next matching string (the
next newline in the input), and the following record starts just after
the end of this string (at the first character of the following line).
The newline, because it matches RS
, is not part of either record.
When RS
is a single character, RT
contains the same single character. However, when RS
is a
regular expression, RT
contains
the actual input text that matched the regular expression.
If the input file ends without any text matching RS
,
gawk
sets RT
to the null string.
The following example illustrates both of these features.
It sets RS
equal to a regular expression that
matches either a newline or a series of one or more uppercase letters
with optional leading and/or trailing whitespace:
$ echo record 1 AAAA record 2 BBBB record 3 | > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } > { print "Record =", $0,"and RT = [" RT "]" }'
-| Record = record 1 and RT = [ AAAA ] -| Record = record 2 and RT = [ BBBB ] -| Record = record 3 and RT = [ -| ]
The square brackets delineate the contents of RT
, letting you
see the leading and trailing whitespace. The final value of
RT
is a newline.
See section A Simple Stream Editor for a more useful example
of RS
as a regexp and RT
.
If you set RS
to a regular expression that allows optional
trailing text, such as ‘RS = "abc(XYZ)?"’, it is possible, due
to implementation constraints, that gawk
may match the leading
part of the regular expression, but not the trailing part, particularly
if the input text that could match the trailing part is fairly long.
gawk
attempts to avoid this problem, but currently, there’s
no guarantee that this will never happen.
NOTE: Remember that in
awk
, the ‘^’ and ‘$’ anchor metacharacters match the beginning and end of a string, and not the beginning and end of a line. As a result, something like ‘RS = "^[[:upper:]]"’ can only match at the beginning of a file. This is becausegawk
views the input file as one long string that happens to contain newline characters. It is thus best to avoid anchor metacharacters in the value ofRS
.
The use of RS
as a regular expression and the RT
variable are gawk
extensions; they are not available in
compatibility mode
(see section Command-Line Options).
In compatibility mode, only the first character of the value of
RS
determines the end of the record.
mawk
has allowed RS
to be a regexp for decades.
As of October, 2019, BWK awk
also supports it. Neither
version supplies RT
, however.
RS = "\0" Is Not Portable
There are times when you might want to treat an entire data file as a
single record. The only way to make this happen is to give You might think that for text files, the NUL character, which
consists of a character with all bits equal to zero, is a good
value to use for BEGIN { RS = "\0" } # whole file becomes one record?
Almost all other It happens that recent versions of See section Reading a Whole File at Once for an interesting way to read
whole files. If you are using |
Previous: awk split records, Up: Records [Contents][Index]