-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathcsvprintf.1.in
346 lines (346 loc) · 9.09 KB
/
csvprintf.1.in
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
.\" -*- nroff -*-
.\"
.\" csvprintf - Simple CSV file parser for the UNIX command line
.\"
.\" Copyright 2010 Archie L. Cobbs <[email protected]>
.\"
.\" Licensed under the Apache License, Version 2.0 (the "License"); you may
.\" not use this file except in compliance with the License. You may obtain
.\" a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
.\"
.\" Unless required by applicable law or agreed to in writing, software
.\" distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
.\" WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
.\" License for the specific language governing permissions and limitations
.\" under the License.
.\"
.Dd November 30, 2010
.Dt CSVPRINTF 1
.Os
.Sh NAME
.Nm csvprintf
.Nd CSV file parser
.Sh SYNOPSIS
.Nm csvprintf
.Bk -words
.Op Ar options
.Ar format
.Ek
.Pp
.Nm csvprintf
.Bk -words
.Fl b
.Op Ar options
.Ek
.Pp
.Nm csvprintf
.Bk -words
.Fl j
.Op Ar options
.Ek
.Pp
.Nm csvprintf
.Bk -words
.Fl x
.Op Ar options
.Ek
.Pp
.Nm csvprintf
.Bk -words
.Fl X
.Op Ar options
.Ek
.Pp
.Nm xml2csv
.Bk -words
.Op Ar file.xml
.Ek
.Sh DESCRIPTION
.Nm
is a simple UNIX command line utility for parsing CSV files.
.Pp
In the first form,
.Nm
works like the
.Xr printf 1
command line utility: you supply a
.Xr printf 1
format string on the command line, and each row of the CSV file is split into arguments and formatted accordingly.
.Pp
The format specifiers in the format string contain numeric or symbolic column accessors to specify which CSV column to format.
.Pp
A numeric column accessor is a sequence of decimal digits followed by the
.Pa $
character (the same accessor format supported by
.Xr printf 1 ) .
So for example,
.Pa \(dq%3$d\(dq
would format the third CSV column as a decimal value.
In addition, the
.Pa \(dq%0$d\(dq
specifier will print the number of columns in the record.
.Pp
When the
.Fl n
flag is given, the first row is assumed to contain column names and is not output.
This allows symbolic, instead of numeric, column accessors to be used.
A symbolic column accessor is the column name enclosed in curly braces.
.Pp
For example, if the first row is
.Pa FirstName,Lastname,IdNum
then the format string
.Pa \(dq%{IdNum}04d: %{LastName}s, %{FirstName}s\(dq
would be equivalent to the format string
.Pa \(dq%3$04d: %2$s, %1$s\(dq .
.Pp
Specifying a column name that does not appear in the first row generates an error,
so the use of symbolic column accessors adds an extra consistency check.
.Sh XML Mode
With
.Fl x ,
the entire file is converted into an XML document.
.Pp
The document element is
.Ar "<csv>" .
.Pp
Each CSV row becomes a
.Ar "<row>"
element containing its individual column values as sub-elements.
.Pp
The column value sub-elements are named
.Ar "<col1>" ,
.Ar "<col2>" ,
etc.;
with
.Fl i ,
the sub-elements use the column names read from the first row (with illegal characters replaced by underscores).
.Pp
In XML mode, a character encoding must be assumed; see
.Fl e .
.Pp
The
.Nm xml2csv
command can convert XML documents generated by
.Nm "csvprintf -x"
back into CSV.
.Sh JSON Mode
With
.Fl j ,
each row is converted into a JSON document.
.Pp
This form is described by RFC 7464 and consists of concatenated JSON documents
framed by ASCII RS and LF control characters, which is compatible with the
.Xr jq 1
utility's
.Fl \-seq
flag.
.Pp
Normally each row is written as a string array;
with
.Fl i ,
each row is written as an object, using column names for fields.
An error occurs if two columns have the same name.
.Pp
In JSON mode, a character encoding must be assumed; see
.Fl e .
.Sh Bash Mode
With
.Fl b ,
each row is converted into
.Xr bash 1
variable assignment(s) which may be applied with the
.Xr eval 1
command.
.Pp
Normally the output just assigns
.Ar ROW
as an array of values.
The resulting output can be used like this:
.Bd -literal -offset indent
cat input.csv | csvprintf -b | while read -r LINE; do
eval "${LINE}"
echo "The first column is: ${ROW[0]}"
echo "The second column is: ${ROW[1]}"
...
done
.Ed
.Pp
With
.Fl i ,
each column value is assigned to a separate variable whose name is the corresponding column name
(with underscores replacing non-alphanumeric characters), and an error occurs if two variables have the same name.
.Pp
So an input file like this:
.Bd -literal -offset indent
"Last Name","First Name","Registered???"
"Washington","George","Y"
"Lincoln","Abe","N"
.Ed
.Pp
can be processed like this:
.Bd -literal -offset indent
cat input.csv | csvprintf -bi -p ROW_ | while read -r LINE; do
eval "${LINE}"
echo "First name: ${ROW_First_Name}"
echo "Last name: ${ROW_Last_Name}"
echo "Registered: ${ROW_Registered___}"
done
.Ed
.Sh Bash Mode Security Concerns
There are two security issues to be aware of when using Bash Mode.
.Pp
First, the
.Fl i
flag opens a security hole because Bash has several special variables like
.Ar PATH ,
.Ar TMPDIR ,
etc., which could be overwritten by malicious input.
To prevent this,
.Nm
omits known Bash variables, but for tighter security use the
.Fl c
flag to explicitly white-list the variables you need.
In addition, use of the
.Fl p
flag is always recommended in Bash Mode to help avoid namespace collisions.
.Pp
Secondly, if the Bash Mode output is piped into
.Ar "while read"
then the
.Fl r
flag must be used to prevent extraneous decoding of backslash escapes.
.Sh Input Encoding
In all modes, lines must be terminated by LF bytes or CR+LF byte pairs, and the separator and quote characters must be recognizable as single byte values.
This parsing behavior is compatible with ASCII, ISO-8859-1, UTF-8, etc., but not multi-byte encodings such as UTF-16, which must be re-encoded (e.g., to UTF-8) first.
.Pp
In normal and Bash modes, column values are copied from input to output bytewise without interpretation.
.Pp
In XML and JSON modes, column values must be interpreted according to an assumed character encoding.
This encoding defaults to ISO-8859-1 but can be changed with the
.Fl e
flag.
.Sh OPTIONS
.Bl -tag -width Ds
.It Fl b
Convert each CSV row into a
.Xr bash 1
variable assignment line.
.It Fl c Ar colname
Specify a column to be included when using column names in XML, JSON, or Bash output.
.Pp
Without this flag, all columns are included.
When this flag is used one or more times,
only the specified columns are included.
.Pp
If any
.Ar colname
doesn't exist, an error occurs.
.It Fl e
Specify input character encoding for XML or JSON mode.
.Pp
By default, ISO-8859-1 is assumed.
.It Fl f
Read CSV input from the specified file.
.Pp
By default (or if ``-'' is specified),
.Nm
reads from standard input.
.It Fl i
Use column names read from the first record in the output.
.Pp
In normal mode, or when used with the
.Fl x
flag, this flag is equivalent to
.Fl n .
.Pp
In JSON mode, output objects instead of arrays and use column names for the object fields.
.Pp
In Bash mode, output a variable for each column instead of a single
.Ar ROW
array variable.
.Pp
It's possible for a row to have more columns than the column header row did.
In that case,
.Nm
reverts to using
.Ar col1 ,
.Ar col2 ,
etc., for any extra columns.
.Pp
This flag implies
.Fl n .
.It Fl j
Convert the input into a JavaScript Object Notation (JSON) text sequence document.
.It Fl n
Assume the first CSV record contains column names and omit from the output.
.Pp
In normal mode, enable symbolic column accessors.
.It Fl p
Specify a common prefix (UTF-8 encoding) to use with all column names in the output.
.Pp
This flag is ignored unless
.Fl i
is specified.
.Pp
.It Fl q
Specify an alternate CSV column quote character.
The usual backslash escape sequences are accepted.
.Pp
The default quote character is double quote.
.It Fl s
Specify an alternate CSV column separator character.
The usual backslash escape sequences are accepted.
.Pp
The default separator character is comma.
.It Fl h
Output usage message and exit.
.It Fl v
Output version information and exit.
.It Fl x
Convert the input into an XML document.
.It Fl X
Convert the input into an XML document using column names for value sub-elements.
.Pp
This flag implies
.Fl n .
.El
.Sh CSV FORMAT
.Nm
parses according to the format described by ``The Comma Separated Value (CSV) File Format'' (see below).
In particular, quote characters must be escaped with an extra quote and whitespace surrounding column values is ignored.
.Sh EXIT STATUS
.Nm
will exit with a status 1 if invalid CSV input is detected.
Otherwise, if an invocation of
.Xr printf 1
fails, processing stops and that exit value is returned.
.Sh FILES
.Bl -tag -width Ds -compact
.It Pa @pkgdatadir@/csv.xsl
XSL transform that converts XML back into CSV format.
.El
.Sh BUGS
.Pp
Under the hood,
.Nm
invokes the
.Xr printf 1
executable on each CSV row it parses, which makes it relatively slow.
.Sh SEE ALSO
.Xr printf 1 ,
.Xr printf 3 ,
.Xr jq 1 .
.Rs
.%T "csvprintf: Simple CSV file parser for the UNIX command line"
.%O https://github.com/archiecobbs/csvprintf
.Re
.Rs
.%T "The Comma Separated Value (CSV) File Format"
.%O http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
.Re
.Rs
.%T "RFC 7464: JavaScript Object Notation (JSON) Text Sequences"
.%O https://datatracker.ietf.org/doc/html/rfc7464
.Re
.Sh AUTHOR
.An Archie L. Cobbs Aq [email protected]