Next: Extract Program, Previous: Word Sorting, Up: Miscellaneous Programs [Contents][Index]
The uniq
program
(see section Printing Nonduplicated Lines of Text)
removes duplicate lines from sorted data.
Suppose, however, you need to remove duplicate lines from a data file but that you want to preserve the order the lines are in. A good example of this might be a shell history file. The history file keeps a copy of all the commands you have entered, and it is not unusual to repeat a command several times in a row. Occasionally you might want to compact the history by removing duplicate entries. Yet it is desirable to maintain the order of the original commands.
This simple program does the job. It uses two arrays. The data
array is indexed by the text of each line.
For each line, data[$0]
is incremented.
If a particular line has not
been seen before, then data[$0]
is zero.
In this case, the text of the line is stored in lines[count]
.
Each element of lines
is a unique command, and the indices of
lines
indicate the order in which those lines are encountered.
The END
rule simply prints out the lines, in order:
# histsort.awk --- compact a shell history file # Thanks to Byron Rakitzis for the general idea
{ if (data[$0]++ == 0) lines[++count] = $0 }
END { for (i = 1; i <= count; i++) print lines[i] }
This program also provides a foundation for generating other useful
information. For example, using the following print
statement in the
END
rule indicates how often a particular command is used:
print data[lines[i]], lines[i]
This works because data[$0]
is incremented each time a line is
seen.
Rick van Rein offers the following one-liner to do the same job of removing duplicates from unsorted text:
awk '{ if (! seen[$0]++) print }'
This can be simplified even further, at the risk of becoming almost too obscure:
awk '! seen[$0]++'
This version uses the expression as a pattern, relying on
awk
’s default action of printing the line when
the pattern is true.
Next: Extract Program, Previous: Word Sorting, Up: Miscellaneous Programs [Contents][Index]