Up: Splitting By Content   [Contents][Index]


4.7.1 More on CSV Files

Manuel Collado notes that in addition to commas, a CSV field can also contains quotes, that have to be escaped by doubling them. The previously described regexps fail to accept quoted fields with both commas and quotes inside. He suggests that the simplest FPAT expression that recognizes this kind of fields is /([^,]*)|("([^"]|"")+")/. He provides the following input data to test these variants:

p,"q,r",s
p,"q""r",s
p,"q,""r",s
p,"",s
p,,s

And here is his test program:

BEGIN {
     fp[0] = "([^,]+)|(\"[^\"]+\")"
     fp[1] = "([^,]*)|(\"[^\"]+\")"
     fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
     FPAT = fp[fpat+0]
}

{
     print "<" $0 ">"
     printf("NF = %s ", NF)
     for (i = 1; i <= NF; i++) {
         printf("<%s>", $i)
     }
     print ""
}

When run on the third variant, it produces:

$ gawk -v fpat=2 -f test-csv.awk sample.csv
-| <p,"q,r",s>
-| NF = 3 <p><"q,r"><s>
-| <p,"q""r",s>
-| NF = 3 <p><"q""r"><s>
-| <p,"q,""r",s>
-| NF = 3 <p><"q,""r"><s>
-| <p,"",s>
-| NF = 3 <p><""><s>
-| <p,,s>
-| NF = 3 <p><><s>