./biology/filter-fastq, Filter reads from a FASTQ file

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 0.0.0.20210527nb2, Package name: filter-fastq-0.0.0.20210527nb2, Maintainer: pkgsrc-users

Filter reads from a FASTQ file using a list of identifiers.

Each entry in the input FASTQ file (or files) is checked against all
entries in the identifier list. Matches are included by default, or
excluded if the --invert flag is supplied. Paired-end files are kept
consistent (in order).

This is almost certainly not the most efficient way to implement this
filtering procedure. I tested a few different strategies and this one
seemed the fastest. Current timing with 16 processes is about 10
minutes per 1M paired reads with gzip'd input and output, depending on
the length of the identifier list to filter by.

usage: filter_fastq.py [-h] [-i INPUT] [-1 READ1] [-2 READ2] [-p NUM_THREADS]
[-o OUTPUT] [-f FILTER_FILE] [-v] [--gzip]


Master sites:


Version history: (Expand)


CVS history: (Expand)


   2023-08-14 07:25:36 by Thomas Klausner | Files touched by this commit (1247)
Log message:
*: recursive bump for Python 3.11 as new default
   2022-06-30 13:19:02 by Nia Alarie | Files touched by this commit (524)
Log message:
*: Revbump packages that use Python at runtime without a PKGNAME prefix
   2021-10-26 12:03:45 by Nia Alarie | Files touched by this commit (73)
Log message:
biology: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes
   2021-10-07 15:19:44 by Nia Alarie | Files touched by this commit (73)
Log message:
biology: Remove SHA1 hashes for distfiles
   2021-05-27 19:11:42 by Brook Milligan | Files touched by this commit (4)
Log message:
biology/filter-fastq: add filter-fastq version 0.0.0.20210527

Filter reads from a FASTQ file using a list of identifiers.

Each entry in the input FASTQ file (or files) is checked against all
entries in the identifier list. Matches are included by default, or
excluded if the --invert flag is supplied. Paired-end files are kept
consistent (in order).

This is almost certainly not the most efficient way to implement this
filtering procedure. I tested a few different strategies and this one
seemed the fastest. Current timing with 16 processes is about 10
minutes per 1M paired reads with gzip'd input and output, depending on
the length of the identifier list to filter by.

usage: filter_fastq.py [-h] [-i INPUT] [-1 READ1] [-2 READ2] [-p NUM_THREADS]
                       [-o OUTPUT] [-f FILTER_FILE] [-v] [--gzip]