Up: Splitting By Content [Contents][Index]
Manuel Collado notes that in addition to commas, a CSV field can also
contains quotes, that have to be escaped by doubling them. The previously
described regexps fail to accept quoted fields with both commas and
quotes inside. He suggests that the simplest FPAT
expression that
recognizes this kind of fields is /([^,]*)|("([^"]|"")+")/
. He
provides the following input data to test these variants:
p,"q,r",s p,"q""r",s p,"q,""r",s p,"",s p,,s
And here is his test program:
BEGIN { fp[0] = "([^,]+)|(\"[^\"]+\")" fp[1] = "([^,]*)|(\"[^\"]+\")" fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")" FPAT = fp[fpat+0] }
{ print "<" $0 ">" printf("NF = %s ", NF) for (i = 1; i <= NF; i++) { printf("<%s>", $i) } print "" }
When run on the third variant, it produces:
$ gawk -v fpat=2 -f test-csv.awk sample.csv -| <p,"q,r",s> -| NF = 3 <p><"q,r"><s> -| <p,"q""r",s> -| NF = 3 <p><"q""r"><s> -| <p,"q,""r",s> -| NF = 3 <p><"q,""r"><s> -| <p,"",s> -| NF = 3 <p><""><s> -| <p,,s> -| NF = 3 <p><><s>