pkgsrc.se | The NetBSD package collection

Subject: CVS commit: pkgsrc/textproc/R-readr
From: Makoto Fujiwara
Date: 2021-09-11 03:07:42
Message id: 20210911010742.E8C37FA97@cvs.NetBSD.org
Log Message:
(textproc/R-readr) Updated 1.3.1 to 2.0.1

# readr 2.0.1

* `options(readr.show_col_types = FALSE)` now works as intended (#1250)

* `read_delim_chunked()` now again correctly respects the `chunk_size`
  parameter (#1248)

* `read_tsv()` now correctly passes the `quote` and `na` arguments to
  `vroom::vroom()` (#1254, #1255)

* Avoid spurious byte compilation errors due to the programatically
  generated `spec_*()` functions.

# readr 2.0.0

## second edition changes

readr 2.0.0 is a major release of readr and introduces a new second
edition parsing and writing engine implemented via the
[vroom](https://vroom.r-lib.org/) package.

This engine takes advantage of lazy reading, multi-threading and
performance characteristics of modern SSD drives to significantly
improve the performance of reading and writing compared to the first
edition engine.

We will continue to support the first edition for a number of
releases, but eventually this support will be first deprecated and
then removed.

You can use the `with_edition()` or `local_edition()` functions to
temporarily change the edition of readr for a section of code.

e.g.

- `with_edition(1, read_csv("my_file.csv"))` will read `my_file.csv`
  with the first edition of readr.

- `readr::local_edition(1)` placed at the top of your function or
  script will use the first edition for the rest of the function or
  script.

### Lazy reading

Edition two uses lazy reading by default.
When you first call a `read_*()` function the delimiters and newlines
throughout the entire file are found, but the data is not actually
read until it is used in your program.  This can provide substantial
speed improvements for reading character data.  It is particularly
useful during interactive exploration of only a subset of a full
dataset.

However this also means that problematic values are not necessarily
seen immediately, only when they are actually read.  Because of this a
warning will be issued the first time a problem is encountered, which
may happen after initial reading.

Run `problems()` on your dataset to read the entire dataset and return
all of the problems found.
Run `problems(lazy = TRUE)` if you only want to retrieve the problems
found so far.

Deleting files after reading is also impacted by laziness.

On Windows open files cannot be deleted as long as a process has the
file open.  Because readr keeps a file open when reading lazily this
means you cannot read, then immediately delete the file.  readr will
in most cases close the file once it has been completely read.
However, if you know you want to be able to delete the file after
reading it is best to pass `lazy = FALSE` when reading the file.

### Reading multiple files at once

Edition two has built-in support for reading sets of files with the
same columns into one output table in a single command.  Just pass the
filenames to be read in the same vector to the reading function.

First we generate some files to read by splitting the nycflights dataset by
airline.

```{r}
library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, \ 
glue::glue("flights_{.y}.tsv"), delim = "\t") }
)
```

Then we can efficiently read them into one tibble by passing the filenames
directly to readr.

```{r}
files <- fs::dir_ls(glob = "flights*tsv")
files
readr::read_tsv(files)
```

If the filenames contain data, such as the date when the sample was collected,
use `id` argument to include the paths as a column in the data.
You will likely have to post-process the paths to keep only the relevant portion \ 
for your use case.

### Delimiter guessing

Edition two supports automatic guessing of delimiters.
Because of this you can now use `read_delim()` without specifying a `delim` \ 
argument in many cases.

```{r}
x <- read_delim(readr_example("mtcars.csv"))
```

### Literal data

In edition one the reading functions treated any input with a newline
in it or vectors of length > 1 as literal data.  In edition two
vectors of length > 1 are now assumed to correspond to multiple files.
Because of this we now have a more explicit way to represent literal
data, by putting `I()` around the input.

```{r}
readr::read_csv(I("a,b\n1,2"))
```

### License changes

We are systematically re-licensing tidyverse and r-lib packages to use
the MIT license, to make our package licenses as clear and permissive
as possible.

To this end the readr and vroom packages are now released under the MIT license.

### Deprecated or superseded functions and features

* `melt_csv()`, `melt_delim()`, `melt_tsv()` and `melt_fwf()` have
  been superseded by functions in the same name in the meltr package.
  The versions in readr have been deprecated.  These functions rely on
  the first edition parsing code and would be challenging to update to
  the new parser.  When the first edition parsing code is eventually
  removed from readr they will be removed.

* `read_table2()` has been renamed to `read_table()`, as most users
  expect `read_table()` to work like `utils::read.table()`.  If you
  want the previous strict behavior of the `read_table()` you can use
  `read_fwf()` with `fwf_empty()` directly (#717).

* Normalizing newlines in files with just carriage returns `\r` is no
  longer supported.  The last major OS to use only CR as the newline
  was 'classic' Mac OS, which had its final release in 2001.

### Other second edition changes

* `read_*_chunked()` functions now include their specification as an
  attribute (#1143)

* All `read_*()` functions gain a `col_select` argument to more easily
  choose which columns to select.

* All `read_*()` functions gain a `id` argument to optionally store
  the file paths when reading multiple files.

* All `read_*()` functions gain a `name_repair` argument to control
  how column names are repaired.

* All `read_*()` and `write_*()` functions gain a `num_threads`
  argument to control the number of processing threads they use
  (#1201)

* All `write_*()` and `format_*()` functions gain `quote` and `escape`
  arguments, to explicitly control how fields are quoted and how
  double quotes are escaped. (#653, #759, #844, #993, #1018, #1083)

* All `write_*()` functions gain a `progress` argument and display a
  progress bar when writing (#791).

* write_excel_csv() now defaults to `quote = "all"` (#759)

* write_tsv() now defaults to `quote = "none"` (#993)

* `read_table()` now handles skipped lines with unpaired quotes properly (#1180)

## Additional features and fixes

* The BH package is no longer a dependency.  The boost C++ headers in
  BH have thousands of files, so can take a long time to extract and
  compiling them takes a great deal of memory, which made readr
  difficult to compile on systems with limited memory (#1147).

* readr now uses the tzdb package when parsing date-times
  (@DavisVaughan, r-lib/vroom#273)

* Chunked readers now support files with more than `INT_MAX` (~ 2
  Billion) number of lines (#1177)

* Memory no longer inadvertently leaks when reading memory from R
  connections (#1161)

* Invalid date formats no longer can potentially crash R (#1151)

* `col_factor()` now throws a more informative error message if given
  non-character levels (#1140)

* `problems()` now takes `.Last.value` as its default argument.  This
  lets you run `problems()` without an argument to see the problems in
  the previously read dataset.

* `read_delim()` fails when sample of parsing problems contains
  non-ASCII characters (@hidekoji, #1136)

* `read_log()` gains a `trim_ws` argument (#738)

* `read_rds()` and `write_rds()` gain a `refhook` argument, to pass
  functions that handle references objects (#1206)

* `read_rds()` can now read .Rds files from URLs (#1186)

* `read_*()` functions gain a `show_col_types` argument, if set to
  `FALSE` this turns off showing the column types unconditionally.

* `type_convert()` now throws a warning if the input has no character
  columns (#1020)

* `write_csv()` now errors if given a matrix column (#1171)

* `write_csv()` now again is able to write data with duplicated column
  names (#1169)

* `write_file()` now forces its argument before opening the output
  file (#1158)

# readr 1.4.0

## Breaking changes

* `write_*()` functions first argument is now `file` instead of
  `path`, for consistency with the `read_*()` functions.  `path` has
  been deprecated and will be removed in a future version of readr
  (#1110, @brianrice2)

* `write_*()` functions now output any NaN values in the same way as
  NA values, controlled by the `na=` argument. (#1082).

## New features

* It is now possible to generate a column specification from any
  tibble (or data.frame) with `as.col_spec()` and convert any column
  specification to a short representation with `as.character()`

      s <- as.col_spec(iris)
      s
      #> cols(
      #>   Sepal.Length = col_double(),
      #>   Sepal.Width = col_double(),
      #>   Petal.Length = col_double(),
      #>   Petal.Width = col_double(),
      #>   Species = col_factor(levels = c("setosa", \ 
"versicolor", "virginica"), ordered = FALSE, include_na = \ 
FALSE)
      #> )
      as.character(s)
      #> [1] "ddddf"

* The cli package is now used for all messages.

* The runtime performance for tables with an extreme number of columns
  is greatly improved (#825)

* Compressed files are now detected by magic numbers rather than by
  the file extension (#1125)

* A memory leak when reading files is now fixed (#1092)

* `write_*()` functions gain a `eol =` argument to control the end of
  line character used (#857).  This allows writing of CSV files with
  Windows newlines (CRLF) if desired.

* The Rcpp dependency has been removed in favor of cpp11.

* The build system has been greatly simplified so should work on more
  systems.

## Additional features and fixes

* The full problem field is now displayed in the problems tibble, as
  intended (#444).

* New `%h` placeholder for parsing unrestricted hours (<0 and >23) to
  support parsing durations (#549, @krlmlr).

* `as.character.col_spec()` now handles logical columns as well
  (#1127)

* `fwf_positions(end)` no longer has a default argument and must be
  specified (#996)

* `guess_parser()` gains a `na` argument and removes NA values before
  guessing (#1041).

* `parse_guess()` now passes the `na` argument to `guess_parser()`

* `read_*` functions now close properly all connections, including on
  errors like HTTP errors when reading from a url (@cderv, #1050).

* `read_delimited()` no longer mistakenly stats literal filenames (#1063)

* `read_lines()` now ignores quotations when skipping lines (#991).

* `read_lines(skip_empty_rows = TRUE)` no longer crashes if a file
  ends with an empty line (#968)

* `write_*()` functions now invisibly return the input data frame
  unchanged, rather than a version with factors and dates converted to
  strings. (@jesse-ross, #975).

* `write_csv2()` now formats decimal numbers more consistently with
  `utils::write.csv2()` (#1087)

* `write_csv2()` and `format_csv2()` no longer pad number columns with
  whitespaces (@keesdeschepper, #1046).

* `write_excel_csv()` no longer outputs a byte order mark when
  appending to a file (#1075).

* Uses of `tibble::data_frame` updated to `tibble::tibble`
  ([tidyverse/dplyr#4069](https://github.com/tidyverse/dplyr/issues/4069),
  @thays42, #1124, @brianrice2)

* `read_delimited()` now returns an empty `tibble::data_frame()`
  rather than signaling an error when given a connection with an empty
  file (@pralitp, #963).

* More helpful error when trying to write out data frames with list
  columns (@ellessenne, #938)

* `type_convert()` removes a 'spec' attribute, because the current
  columns likely have modified data types.  The 'spec' attribute is
  set by functions like `read_delim()` (@jimhester, @wibeasley,
  #1032).

* `write_rds()` now can specify the Rds version to use. The default
  value is 2 as it's compatible to R versions prior to 3.5.0
  (@shrektan, #1001).

* Fixes for issues related to variable initialization in C++ code
  (@michaelquinn32, ##1133).
Files:
Revision	Action	file
1.5	modify	pkgsrc/textproc/R-readr/Makefile
1.3	modify	pkgsrc/textproc/R-readr/distinfo
1.1	remove	pkgsrc/textproc/R-readr/patches/patch-src_Makevars