Path to this page:
Subject: CVS commit: pkgsrc/textproc/miller
From: Thomas Klausner
Date: 2017-03-05 13:37:30
Message id: 20170305123730.406B7FBE4@cvs.NetBSD.org
Log Message:
Updated miller to 5.0.0.
Autodetected line-endings, in-place mode, user-defined functions, and more
This major release significantly expands the expressiveness of the DSL for mlr \
put and mlr filter. (The upcoming 5.1.0 release will add the ability to \
aggregate across all columns for non-DSL verbs such as mlr stats1 and mlr \
stats2. As well, a Windows port is underway.)
Please also see the Miller main docs.
Simple but impactful features:
Line endings (CRLF vs. LF, Windows-style vs. Unix-style) are now \
autodetected. For example, files (including CSV) with LF input will lead to LF \
output unless you specify otherwise.
There is now an in-place mode using mlr -I.
Major DSL features:
You can now define your own functions and subroutines: e.g. func f(x, y) { \
return x**2 + y**2 }.
New local variables are completely analogous to out-of-stream variables: sum \
retains its value for the duration of the expression it's defined in; @sum \
retains its value across all records in the record stream.
Local variables, function parameters, and function return types may be \
defined untyped or typed as in x = 1 or int x = 1, respectively. There are also \
expression-inline type-assertions available. Type-checking is up to you: omit it \
if you want flexibility with heterogeneous data; use it if you want to help \
catch misspellings in your DSL code or unexpected irregularities in your input \
data.
There are now four kinds of maps. Out-of-stream variables have always been \
scalars, maps, or multi-level maps: @a=1, @b[1]=2, @c[1][2]=3. The same is now \
true for local variables, which are new to 5.0.0. Stream records have always \
been single-level maps; $* is a map. And as of 5.0.0 there are now map literals, \
e.g. {"a":1, "b":2}, which can be defined using JSON-like \
syntax (with either string or integer keys) and which can be nested arbitrarily \
deeply.
You can loop over maps -- $*, out-of-stream variables, local variables, \
map-literals, and map-valued function return values -- using for (k, v in ...) \
or the new for (k in ...) (discussed next). All flavors of map may also be used \
in emit and dump statements.
User-defined functions and subroutines may take map-valued arguments, and \
may return map values.
Some built-in functions now accept map-valued input: typeof, length, depth, \
leafcount, haskey. There are built-in functions producing map-valued output: \
mapsum and mapdiff. There are now string-to-map and map-to-string functions: \
splitnv, splitkv, splitnvx, splitkvx, joink, joinv, and joinkv.
Minor DSL features:
For iterating over maps (namely, local variables, out-of-stream variables, \
stream records, map literals, or return values from map-valued functions) there \
is now a key-only for-loop syntax: e.g. for (k in $*) { ... }. This is in \
addition to the already-existing for (k, v in ...) syntax.
There are now triple-statement for-loops (familiar from many other \
languages), e.g. for (int i = 0; i < 10; i += 1) { ... }.
mlr put and mlr filter now accept multiple -f for script files, freely \
intermixable with -e for expressions. The suggested use case is putting \
user-defined functions in script files and one-liners calling them using -e. \
Example: myfuncs.mlr defines the function f(...), then mlr put -f myfuncs.mlr -e \
'$o = f($i)' myfile.dat. More information is here.
mlr filter is now almost identical to mlr put: it can have multiple \
statements, it can use begin and/or end blocks, it can define and invoke \
functions. Its final expression must evaluate to boolean which is used as the \
filter criterion. More details are here.
The min and max functions are now variadic: $o = max($a, $b, $c).
There is now a substr function.
While ENV has long provided read-access to environment variables on the \
right-hand side of assignments (as a getenv), it now can be at the left-hand \
side of assignments (as a putenv). This is useful for subsidiary processes \
created by tee, emit, dump, or print when writing to a pipe.
Handling for the # in comments is now handled in the lexer, so you can now \
(correctly) include # in strings.
Separators are now available as read-only variables in the DSL: IPS, IFS, \
IRS, OPS, OFS, ORS. These are particularly useful with the split and join \
functions: e.g. with mlr --ifs tab ..., the IFS variable within a DSL expression \
will evaluate to a string containing a tab character.
Syntax errors in DSL expressions now have a little more context.
DSL parsing and execution are a bit more transparent. There have long been \
-v and -t options to mlr put and mlr filter, which print the expression's \
abstract syntax tree and do a low-level parser trace, respectively. There are \
now additionally -a which traces stack-variable allocation and -T which traces \
statements line by line as they execute. While -v, -t, and -a are most useful \
for development of Miller, the -T option gives you more visibility into what \
your Miller scripts are doing. See also here.
Verbs:
most-frequent and least-frequent as requested in #110.
seqgen makes it easy to generate data from within Miller: please also see \
here for a usage example.
unsparsify makes it easy to rectangularize data where not all records have \
the same fields.
cat -n now takes a group-by (-g) option, making it easy to number records \
within categories.
count-distinct,
uniq,
most-frequent,
least-frequent,
top, and
histogram
now take a -o option for specifying their output field names, as requested \
in #122.
Median is now a synonym for p50 in stats1.
You can now start a then chain with an initial then, which is nice in \
backslashy/multiline-continuation contexts.
This was requested in #130.
I/O options:
The print statement may now be used with no arguments, which prints a \
newline, and a no-argument printn prints nothing but creates a zero-length file \
in redirected-output context.
Pretty-print format now has a --pprint --barred option (for output only, not \
input). For an example, please see here.
There are now keystroke-savers of the form --c2p which abbreviate --icsvlite \
--opprint, and so on.
Miller's map literals are JSON-looking but allow integer keys which JSON \
doesn't. The
--jknquoteint and --jvquoteall flags for mlr (when using JSON output) and \
mlr put (for dump) provide control over double-quoting behavior.
Documents new since the previous release:
Miller in 10 minutes is a long-overdue addition: while Miller's detailed \
documentation is evident, there has been a lack of more succinct examples.
The cookbook has likewise been expanded, and has been split out
into three parts: part 1, part
2, part 3.
A bit more background on C performance compared to other languages I \
experimented with, early on in the development of Miller, is here.
On-line help:
Help for DSL built-in functions, DSL keywords, and verbs is accessible using \
mlr -f, mlr -k, and mlr -l respectively; name-only lists are available with mlr \
-F, mlr -K, and mlr -L.
Bugfixes:
A corner-case bug causing a segmentation violation on two sub/gsub \
statements within a single put, the first one matching its pattern and the \
second one not matching its pattern, has been fixed.
Backward incompatibilities: This is Miller 5.0.0, not 4.6.0, due to the \
following (all relatively minor):
The v variables bound in for-loops such as for (k, v in \
some_multi_level_map) { ... } can now be map-valued if the v specifies a \
non-terminal in the map.
There are new keywords such as var, int, float, num, str, bool, map, IPS, \
IFS, IRS, OPS, OFS, ORS which can no longer be used as variable names. See mlr \
-k for the complete list.
Unset of the last key in an map-valued variable's map level no longer \
removes the level: e.g. with @v[1][2]=3 and unset @v[1][2] the @v variable would \
be empty. As of 5.0.0, @v has key 1 with an empty-map value.
There is no longer type-inference on literals: "3"+4 no longer \
gives 7. (That was never a good idea.)
The typeof function used to say things like MT_STRING; now it says things \
like string.
Files: