Cutadapt¶
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you don’t want them to be in your reads.
Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter single-end and paired-end reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Cutadapt can also demultiplex your reads.
Cutadapt is available under the terms of the MIT license.
Cutadapt development was started at TU Dortmund University in the group of Prof. Dr. Sven Rahmann. It is currently being developed within NBIS (National Bioinformatics Infrastructure Sweden).
If you use Cutadapt, please cite DOI:10.14806/ej.17.1.200 .
Links¶
- Documentation
- Source code
- Report an issue
- Project page on PyPI (Python package index)
- Follow @marcelm_ on Twitter
- Wrapper for the Galaxy platform
Installation¶
Because Cutadapt development happens on Linux, this is the best supported platform, but it should also run on macOS and Windows.
Installation with conda¶
Cutadapt is available as a Conda package from the Bioconda channel. Install miniconda if you don’t have Conda. Then follow the Bioconda installation instructions (in particular, make sure you have both bioconda and conda-forge in your channels list).
To then install Cutadapt into a new Conda environment, use this command:
conda create -n cutadaptenv cutadapt
Here, cutadaptenv
is the name of the Conda environment. You can
choose a different name.
Then activate the environment. This needs to be done every time you open a new shell before you can use Cutadapt:
conda activate cutadaptenv
Finally, check whether it worked:
cutadapt --version
This should show the Cutadapt version number.
Installation with pip¶
If Python is already installed on your system (it very likely is), you
can install Cutadapt using pip
on the command line:
python3 -m pip install --user --upgrade cutadapt
This will download the software from PyPI (the Python packaging
index), and
install the cutadapt
binary into $HOME/.local/bin
. If an old version of
Cutadapt exists on your system, the --upgrade
parameter is required in order
to install a newer version.
On many systems, you can then run the program like this:
cutadapt --version
If this does not work or this prints an unexpected version number, then you need to use the full path to run the program:
~/.local/bin/cutadapt --version
Alternatively, you can avoid having to type the full path by adding the
directory $HOME/.local/bin
to your $PATH
environment variable.
Installation on a Debian-based Linux distribution¶
Cutadapt is also included in Debian-based Linux distributions, such as Ubuntu. Simply use your favorite package manager to install Cutadapt. On the command-line, this should work
sudo apt install cutadapt
or possibly
sudo apt install python3-cutadapt
Please be aware that distribution packages are very likely to be outdated. If you encounter unexpected behavior or need newer features, please use one of the other installation methods to get an up-to-date version before reporting bugs.
Dependencies¶
Cutadapt installation requires this software to be installed:
- Python 3.7 or newer
- Possibly a C compiler. For Linux, Cutadapt packages are provided as
so-called “wheels” (
.whl
files) which come pre-compiled.
Under Ubuntu, you may need to install the packages build-essential
and
python3-dev
to get a C compiler.
If you get an error message:
error: command 'gcc' failed with exit status 1
Then check the entire error message. If it says something about a missing
Python.h
file, then the problem is that you are missing Python development
packages (python3-dev
in Ubuntu).
System-wide installation (root required)¶
Generally, using sudo
can be dangerous and the above methods that don’t
require it are preferred. That said, if you have root access, you can install
Cutadapt system-wide by running:
sudo python3 -m pip install cutadapt
This installs cutadapt into /usr/local/bin
.
If you want to upgrade from an older version, use this command instead:
sudo python3 -m pip install --upgrade cutadapt
If the above does not work for you, then you can try to install Cutadapt into a virtual environment. This leads to fewer conflicts with system-installed packages:
sudo python3 -m venv /usr/local/cutadapt
sudo /usr/local/cutadapt/bin/pip install cutadapt
cd /usr/local/bin/
sudo ln -s ../cutadapt/bin/cutadapt
Installation on Windows¶
For some releases of Cutadapt, a single-file executable (cutadapt.exe
)
is made available on the
GitHub releases page. Try that
first, and if it does not work for you, please report the issue.
To install Cutadapt manually, keep reading.
There is no Bioconda package for Windows because Bioconda does not produce
Windows packages. To install Cutadapt, you can use pip
, but because
Cutadapt contains components that need to be compiled, you also need to install
a compiler.
Download a recent version (at least 3.7) of Python for Windows from <https://www.python.org/> and install it.
Download and install “Build Tools for Visual Studio 2019” from <https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2019>. (There are many similarly named downloads on that page, ensure you get the right one.)
During installation, when the dialog about which components to install pops up, ensure that “C++ Build tools” is ticked. The download is quite big and can take a long time.
Open the command line (
cmd.exe
) and runpy -m pip install cutadapt
.Test whether it worked by running
py -m cutadapt --version
. You should see the version number of Cutadapt.
When running Cutadapt this way, you will need to remember to write
py -m cutadapt
instead of just cutadapt
.
Uninstalling¶
Type
pip3 uninstall cutadapt
and confirm with y
to remove the package. Under some circumstances, multiple
versions may be installed at the same time. Repeat the above command until you
get an error message in order to make sure that all versions are removed.
Installing the development version¶
We recommend that you install Cutadapt into a so-called virtual environment if you decide to use the development version. The virtual environment is a single directory that contains everything needed to run the software. Nothing else on your system is changed, so you can simply uninstall this particular version of Cutadapt by removing the directory with the virtual environment.
The following instructions work on Linux using Python 3. Make sure you have
installed the dependencies (python3-dev
and
build-essential
on Ubuntu)!
First, choose where you want to place the directory with the virtual
environment and what you want to call it. Let us assume you chose the path
~/cutadapt-venv
. Then use these commands for the installation:
python3 -m venv ~/cutadapt-venv
~/cutadapt-venv/bin/python3 -m pip install --upgrade pip
~/cutadapt-venv/bin/pip install git+https://github.com/marcelm/cutadapt.git#egg=cutadapt
To run Cutadapt and see the version number, type
~/cutadapt-venv/bin/cutadapt --version
The reported version number will be something like 2.2.dev5+gf564208
. This
means that you are now running the version of Cutadapt that will become 2.2, and that it contains
5 changes (commits) since the previous release (2.1 in this case).
User guide¶
Basic usage¶
To trim a 3’ adapter, the basic command-line for Cutadapt is:
cutadapt -a AACCGGTT -o output.fastq input.fastq
The sequence of the adapter is given with the -a
option. You need to replace
AACCGGTT
with the correct adapter sequence. Reads are read from the input
file input.fastq
and are written to the output file output.fastq
.
Compressed in- and output files are also supported:
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz
Cutadapt searches for the adapter in all reads and removes it when it finds it. Unless you use a filtering option, all reads that were present in the input file will also be present in the output file, some of them trimmed, some of them not. Even reads that were trimmed to a length of zero are output. All of this can be changed with command-line options, explained further down.
Trimming of paired-end data is also supported.
Input and output file formats¶
The supported input and output file formats are FASTA and FASTQ, with optional compression.
The input file format is recognized from the file name extension. If the extension was not recognized or when Cutadapt reads from standard input, the contents are inspected instead.
The output file format is also recognized from the file name extension. If the extensions was not recognized or when Cutadapt writes to standard output, the same format as the input is used for the output.
See also file format conversion.
Compressed files¶
Cutadapt supports compressed input and output files. Whether an input file
needs to be decompressed or an output file needs to be compressed is detected
automatically by inspecting the file name: For example, if it ends in .gz
,
then gzip compression is assumed
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz
All of Cutadapt’s options that expect a file name support this.
The supported compression formats are gzip (.gz
), bzip2 (.bz2
)
and xz (.xz
).
The default compression level for gzip output is 6. Use option -Z
to
change this to level 1. The files need more space, but it is faster and
therefore a good choice for short-lived intermediate files.
If available, Cutadapt uses pigz to speed up writing and reading of gzipped files.
Standard input and output¶
If no output file is specified via the -o
option, then the output is sent to
the standard output stream. Example:
cutadapt -a AACCGGTT input.fastq > output.fastq
There is one difference in behavior if you use Cutadapt without -o
: The
report is sent to the standard error stream instead of standard output. You
can redirect it to a file like this:
cutadapt -a AACCGGTT input.fastq > output.fastq 2> report.txt
Wherever Cutadapt expects a file name, you can also write a dash (-
) in
order to specify that standard input or output should be used. For example:
tail -n 4 input.fastq | cutadapt -a AACCGGTT - > output.fastq
The tail -n 4
prints out only the last four lines of input.fastq
, which
are then piped into Cutadapt. Thus, Cutadapt will work only on the last read in
the input file.
In most cases, you should probably use -
at most once for an input file and
at most once for an output file, in order not to get mixed output.
For the same reason, you should not use -
for non-interleaved paired-end
data.
You cannot combine -
and gzip compression since Cutadapt needs to know the
file name of the output or input file. if you want to have a gzip-compressed
output file, use -o
with an explicit name.
One last “trick” is to use /dev/null
as an output file name. This special
file discards everything you send into it. If you only want to see the
statistics output, for example, and do not care about the trimmed reads at all,
you could use something like this:
cutadapt -a AACCGGTT -o /dev/null input.fastq
Multi-core support¶
Cutadapt supports parallel processing, that is, it can use multiple CPU cores.
Multi-core is not enabled by default. To enable it, use the option -j N
(or the spelled-out version --cores=N
), where N
is the
number of cores to use.
To automatically detect the number of available cores, use -j 0
(or --cores=0
). The detection takes into account resource restrictions
that may be in place. For example, if running Cutadapt as a batch job on a
cluster system, the actual number of cores assigned to the job will be used.
(This works if the cluster systems uses the cpuset(1) mechanism to impose
the resource limitation.)
Make also sure that you have pigz
(parallel gzip) installed if you use
multiple cores and write to a .gz
output file. Otherwise, compression of
the output will be done in a single thread and therefore be a bottleneck.
New in version 1.15.
New in version 1.18: --cores=0
for autodetection
New in version 2.5: Multicore works with --untrimmed/too-short/too-long-(paired)-output
New in version 2.7: Multicore works with --info-file
, --rest-file
, --wildcard-file
New in version 3.0: Multicore support for demultiplexing added.
Speed-up tricks¶
There are several tricks for limiting wall-clock time while using Cutadapt.
-Z
(equivalent to --compression-level=1
) can be used to limit the
amount of CPU time which is spent on the compression of output files.
Alternatively, choosing filenames not ending with .gz
, .bz2
or .xz
will make sure no CPU time is spent on compression at all. On systems
with slow I/O, it can actually be faster to set a higher compression-level
than 1.
Increasing the number of cores with -j
will increase the number of reads per
minute at near-linear rate.
It is also possible to use pipes in order to bypass the filesystem and pipe
cutadapt’s output into an aligner such as BWA. The mkfifo
command allows
you to create named pipes in bash.
This command will run cutadapt and BWA simultaneously, using Cutadapt’s output as
BWA’s input, and capturing Cutadapt’s report in cutadapt.report
.
Read processing stages¶
Cutadapt can do a lot more in addition to removing adapters. There are various command-line options that make it possible to modify and filter reads and to redirect them to various output files. Each read is processed in the following order:
- Read modification options are applied. This includes
adapter removal,
quality trimming, read name modifications etc. The
order in which they are applied is the order in which they are listed in the
help shown by
cutadapt --help
under the “Additional read modifications” heading. Adapter trimming itself does not appear in that list and is done after quality trimming and before length trimming (--length
/-l
). - Filtering options are applied, such as removal of too
short or untrimmed reads. Some of the filters also allow to redirect a read
to a separate output file. The filters are applied in the order in which
they are listed in the help shown by
cutadapt --help
under the “Filtering of processed reads” heading. - If the read has passed all the filters, it is written to the output file.
Adapter types¶
Cutadapt can detect multiple adapter types. 5’ adapters preceed the sequence of interest and 3’ adapters follow it. Further distinctions are made according to where in the read the adapter sequence is allowed to occur.
Adapter type | Command-line option |
---|---|
Regular 3’ adapter | -a ADAPTER |
Regular 5’ adapter | -g ADAPTER |
Non-internal 3’ adapter | -a ADAPTERX |
Non-internal 5’ adapter | -g XADAPTER |
Anchored 3’ adapter | -a ADAPTER$ |
Anchored 5’ adapter | -g ^ADAPTER |
5’ or 3’ (both possible) | -b ADAPTER |
Linked adapter | -a ^ADAPTER1...ADAPTER2 -g ADAPTER1...ADAPTER2 |
By default, all adapters are searched error-tolerantly.
Adapter sequences may also contain any IUPAC wildcard
character (degenerate bases) (such as N
).
In addition, it is possible to remove a fixed number of bases from the beginning or end of each read, to remove low-quality bases (quality trimming) from the 3’ and 5’ ends, and to search for adapters also in the reverse-complemented reads.
Overview of adapter types¶
3’ adapter types¶
A 3’ adapter is assumed to be ligated to the 3’ end of your sequence of interest. When such an adapter is found, the adapter sequence itself and the sequence following it (if there is any) are trimmed. This table shows in which ways the different 3’ adapter types are allowed to occur in a read in order to be recognized by the program.
Adapter location in read | Read layout | Found by regular 3’
-a ADAPTER |
Found by non-internal 3’
-a ADAPTERX |
Found by anchored 3’
-a ADAPTER$ |
---|---|---|---|---|
Full adapter sequence anywhere | acgtacgtADAPTERacgt | yes | no | no |
Partial adapter sequence at 3’ end | acgtacgtacgtADAP | yes | yes | no |
Full adapter sequence at 3’ end | acgtacgtacgtADAPTER | yes | yes | yes |
5’ adapter types¶
A 5’ adapter is assumed to be ligated to the 5’ end of your sequence of interest. When such an adapter is found, the adapter sequence itself and the sequence preceding it (if there is any) are trimmed. This table shows in which ways the different 5’ adapter types are allowed to occur in a read in order to be recognized by the program.
Adapter location in read | Read layout | Found by regular 5’
-g ADAPTER |
Found by non-internal 5’
-g XADAPTER |
Found by anchored 5’
-g ^ADAPTER |
---|---|---|---|---|
Full adapter sequence anywhere | acgtADAPTERacgtacgt | yes | no | no |
Partial adapter sequence at 5’ end | PTERacgtacgtacgt | yes | yes | no |
Full adapter sequence at 5’ end | ADAPTERacgtacgtacgt | yes | yes | yes |
Regular 3’ adapters¶
A 3’ adapter is a piece of DNA ligated to the 3’ end of the DNA fragment of interest. The sequencer starts the sequencing process at the 5’ end of the fragment. If the fragment is shorter than the read length, the sequencer will sequence into the adapter and the reads will thus contain some part of the adapter. Depending on how much longer the read is than the fragment of interest, the adapter occurs 1) not at all, 2) partially or fully at the end of the read (not followed by any other bases), or 3) in full somewhere within the read, followed by some other bases.
Use Cutadapt’s -a
option to find and trim such an adapter, allowing
both partial and full occurrences.
For example, assume your fragment of interest is mysequence and the adapter is ADAPTER. Depending on the read length, you will get reads that look like this:
mysequen
mysequenceADAP
mysequenceADAPTER
mysequenceADAPTERsomethingelse
Using -a ADAPTER
to remove this type of adapter, this will
be the result:
mysequen
mysequence
mysequence
mysequence
As this example shows, Cutadapt allows regular 3’ adapters to occur in full anywhere within the read (preceeded and/or succeeded by zero or more bases), and also partially degraded at the 3’ end. Cutadapt deals with 3’ adapters by removing the adapter itself and any sequence that may follow. As a consequence, a sequence that starts with an adapter, like this, will be trimmed to an empty read:
ADAPTERsomething
By default, empty reads are kept and will appear in the output. If you do not
want this, use the --minimum-length
/-m
filtering option.
Regular 5’ adapters¶
Note
Unless your adapter may also occur in a degraded form, you probably want to use an anchored 5’ adapter.
A 5’ adapter is a piece of DNA ligated to the 5’ end of the DNA fragment of interest. For this type of adapter to be found, the adapter sequence needs to either appear in full somewhere within the read (internal match) or at the start (5’ end) of it, where in the latter case also partial occurrences are allowed. In all cases, the adapter itself and the sequence preceding it is removed.
Assume your fragment of interest is mysequence and the adapter is ADAPTER. The reads may look like this:
ADAPTERmysequence
DAPTERmysequence
TERmysequence
somethingADAPTERmysequence
All the above sequences are trimmed to mysequence
when you use -g ADAPTER.
As with 3’ adapters, the resulting read may have a length of zero when the
sequence ends with the adapter. For example, the read
somethingADAPTER
will be empty after trimming.
Anchored 5’ adapters¶
An anchored 5’ adapter is an adapter that is expected to occur in full length at the beginning of the read. Example:
ADAPTERsomething
This is usually how forward PCR primers are found in the read in amplicon sequencing, for instance. In Cutadapt’s terminology, this type of adapter is called “anchored” to distinguish it from “regular” 5’ adapters, which are 5’ adapters with a less strict placement requirement.
If the adapter sequence is ADAPTER
, use -g ^ADAPTER
to remove an
anchored 5’ adapter. The ^
is meant to indicate the “anchoring” to the
beginning of the read. With this, the example read ADAPTERsomething
is
trimmed to just something
.
An anchored 5’ adapter must occur in full at the beginning of the read.
If the read happens to be shorter than the adapter, partial occurrences
such as ADAPT
are not found.
The requirement for a full match at the beginning of the read is relaxed when Cutadapt searches error-tolerantly, as it does by default. In particular, insertions and deletions may allow reads such as these to be trimmed, assuming the maximum error rate is sufficiently high:
BADAPTERsomething
ADAPTE
The B
in the beginning is seen as an insertion, and the missing R
as a deletion. If you also want to prevent this from happening, use the
option --no-indels
, which disallows insertions and deletions entirely.
Anchored 3’ adapters¶
It is also possible to anchor 3’ adapters to the end of the read. This is
useful, for example, if you work with merged overlapping paired-end
reads. Add the $
character to the end of an
adapter sequence specified via -a
in order to anchor the adapter to the
end of the read, such as -a ADAPTER$
. The adapter will only be found if it
occurs in full at the end of the read (that is, it must be a suffix of the
read.
The requirement for a full match exactly at the end of the read is relaxed when
Cutadapt searches error-tolerantly, as it does by default.
You can disable insertions and deletions with --no-indels
.
Anchored 3’ adapters work as if you had reversed the sequence and used an appropriate anchored 5’ adapter.
As an example, assume you have these reads:
mysequenceADAP
mysequenceADAPTER
mysequenceADAPTERsomethingelse
Using -a ADAPTER$
will result in:
mysequenceADAP
mysequence
mysequenceADAPTERsomethingelse
That is, only the middle read is trimmed at all.
Non-internal 5’ and 3’ adapters¶
The non-internal 5’ and 3’ adapter types disallow internal occurrences of the adapter sequence. This is like a less strict version of anchoring: The adapter must always be at one of the ends of the read, but - unlike anchored adapters - partial occurrences are also ok.
Use -a ADAPTERX
(replace ADAPTER
with your actual adapter sequence, but
use a literal X
) to disallow internal matches for a 3’ adapter. Use
-g XADAPTER
to disallow them for a 5’ adapter.
Mnemonic: The X
is not allowed to “shift into” the read.
Here are some examples for trimming reads with -a ADAPTERX
:
Input read | Processed read |
---|---|
mysequenceADAP |
mysequence |
mysequenceADAPTER |
mysequence |
mysequenceADAPTERsomethingelse |
mysequenceADAPTERsomethingelse |
Here are some examples for trimming reads with -g XADAPTER
:
Input read | Processed read |
---|---|
APTERmysequence |
mysequence |
ADAPTERmysequence |
mysequence |
somethingelseADAPTERmysequence |
somethingelseADAPTERmysequence |
New in version 1.17.
Linked adapters (combined 5’ and 3’ adapter)¶
If your sequence of interest is surrounded by a 5’ and a 3’ adapter, and you want
to remove both adapters, then you can use a linked adapter. A linked
adapter combines a 5’ and a 3’ adapter. By default, the adapters are not anchored,
but in many cases, you should anchor the 5’ adapter by prefixing it with ^
.
See the previous sections for what anchoring means.
Note
Cutadapt versions before 2.0 anchored the 5’ adapter within linked adapters
automatically even if the initial ^
was not specified. If you have scripts
written for Cutadapt versions earlier than 2.0, please add the ^
so that
the behavior does not change!
Linked adapters are specified as two sequences separated by ...
(three dots):
cutadapt -a ^ADAPTER1...ADAPTER2 -o out.fastq.gz in.fastq.gz
If you anchor an adapter, it will also become marked as being required. If a required adapter cannot be found, the read will not be trimmed at all even if the other adapter occurs. If an adapter is not required, it is optional.
Also, when you use the --discard-untrimmed
option (or --trimmed-only
) with a
linked adapter, then a read is considered to be trimmed only if all required adapters
were found.
In the previous example, ADAPTER1
was anchored and therefore required, but ADAPTER2
was optional. Anchoring also ADAPTER2
(and making it required as well) would look like this:
cutadapt -a ^ADAPTER1...ADAPTER2$ -o out.fastq.gz in.fastq.gz
As an example, assume the 5’ adapter is FIRST, the 3’ adapter is SECOND and you have these input reads:
FIRSTmysequenceSECONDextrabases
FIRSTmysequenceSEC
FIRSTmyseque
anotherreadSECOND
Trimming with
cutadapt -a ^FIRST...SECOND -o output.fastq input.fastq
will result in
mysequence
mysequence
myseque
anotherreadSECOND
The 3’ adapter in the last read is not trimmed because the anchored 5’ adapter is required, but missing in the read.
Linked adapters do not work when used in combination with --info-file
and --mask-adapter
.
To provide adapter-trimming parameters
for linked adapters, they need to be set for each constituent adapter separately, as in
-g "ADAPTER1;min_overlap=5...ADAPTER2;min_overlap=6"
.
New in version 1.10.
New in version 1.13: Ability to anchor the 3’ adapter.
New in version 2.0: The 5’ adapter is no longer anchored by default.
Changing which adapters are required¶
As described, when you specify a linked adapter with -a
, the adapters that are anchored
become required, and the non-anchored adapters become optional. To change this, you can
instead use -g
to specify a linked adapter. In that case, both adapters are required
(even if they are not anchored). This type of linked adapter type is especially suited for
trimming CRISPR screening reads. For example:
cutadapt -g ADAPTER1...ADAPTER2 -o out.fastq.gz in.fastq.gz
Here, both ADAPTER1
and ADAPTER2
are not anchored, but they are required because -g
was used.
The -g
option does not cover all cases, so you can also mark each adapter explicitly as
required or optional using the trimming parameters
required
and optional
. This is the only way to make an anchored adapter optional.
For example, to request that an anchored 5’ adapter (here ADAPTER1
) should not be required,
you can specify it like this
cutadapt -a "^ADAPTER1;optional...ADAPTER2" -o output.fastq.gz input.fastq.gz
New in version 1.13: Option -g
added.
Changed in version 1.15: Option -g
requires both adapters.
Linked adapter statistics¶
For linked adapters, the statistics report contains a line like this:
=== Adapter 1 ===
Sequence: AAAAAAAAA...TTTTTTTTTT; Type: linked; Length: 9+10; Trimmed: 3 times; Half matches: 2
The value for “Half matches” tells you how often only the 5’-side of the adapter was found, but not the 3’-side of it. This applies only to linked adapters with regular (non-anchored) 3’ adapters.
5’ or 3’ adapters¶
The last type of adapter is a combination of the 5’ and 3’ adapter. You can use it when your adapter is ligated to the 5’ end for some reads and to the 3’ end in other reads. This probably does not happen very often, and this adapter type was in fact originally implemented because the library preparation in an experiment did not work as it was supposed to.
For this type of adapter, the sequence is specified with -b ADAPTER
(or use
the longer spelling --anywhere ADAPTER
). The adapter may appear in the
beginning (even degraded), within the read, or at the end of the read (even
partially). The decision which part of the read to remove is made as follows: If
there is at least one base before the found adapter, then the adapter is
considered to be a 3’ adapter and the adapter itself and everything
following it is removed. Otherwise, the adapter is considered to be a 5’
adapter and it is removed from the read, but the sequence after it remains.
Here are some examples.
Read before trimming | Read after trimming | Detected adapter type |
---|---|---|
MYSEQUENCEADAPTERSOMETHING |
MYSEQUENCE |
3’ adapter |
MYSEQUENCEADAPTER |
MYSEQUENCE |
3’ adapter |
MYSEQUENCEADAP |
MYSEQUENCE |
3’ adapter |
MADAPTER |
M |
3’ adapter |
ADAPTERMYSEQUENCE |
MYSEQUENCE |
5’ adapter |
PTERMYSEQUENCE |
MYSEQUENCE |
5’ adapter |
TERMYSEQUENCE |
MYSEQUENCE |
5’ adapter |
Multiple adapter occurrences within a single read¶
If a single read contains multiple copies of the same adapter, the basic rule is that the leftmost match is used for both 5’ and 3’ adapters. For example, when searching for a 3’ adapter in
cccccADAPTERgggggADAPTERttttt
the read will be trimmed to
ccccc
When the adapter is a 5’ adapter instead, the read will be trimmed to
gggggADAPTERttttt
Adapter-trimming parameters¶
The adapter-trimming algorithm has a few parameters specific to each adapter
that control how the adapter sequence is found. The command-line options -e
and -O
set the maximum error rate and minimum overlap parameters (see
details in the following sections) for all
adapters listed via the -a
/-b
/-g
etc. options. When trimming more
than one adapter, it may be necessary to change parameters for each
adapter individually. You can do so by adding a semicolon and parameter=value
to the end
of the adapter sequence, as in -a "ADAPTER;max_error_rate=0.2"
.
Multiple parameters can also be set, as in -a "ADAPTER;max_error_rate=0.2;min_overlap=5"
.
If using linked adapters, they have separate settings as in
-g "ADAPTER1;min_overlap=5...ADAPTER2;min_overlap=6"
.
Remember to add the quotation marks; otherwise the shell will interpret the semicolon as a separator between two commands.
The following parameters are supported:
Parameter | Global option | Adapter-specific parameter |
---|---|---|
Maximum error rate (default: 0.1) | -e 0.2 |
ADAPTER;e=0.2 orADAPTER;max_errors=0.2 orADAPTER;max_error_rate=0.2 |
Minimum overlap (default: 3) | -O 5 |
ADAPTER;o=5 orADAPTER;min_overlap=5 |
Disallow indels | ADAPTER;noindels |
|
Allow indels (this is the default) | ADAPTER;indels |
|
Allow matches anywhere | ADAPTER;anywhere |
|
Linked adapter required | ADAPTER;required |
|
Linked adapter optional | ADAPTER;optional |
The minimum overlap length cannot be set for anchored adapters as these always need to occur at full length.
Adapter-specific parameters override the global option.
Error tolerance¶
All searches for adapter sequences are error tolerant. Allowed errors are
mismatches, insertions and deletions. For example, if you search for the
adapter sequence ADAPTER
and the error tolerance is set appropriately
(as explained below), then also ADABTER
will be found (with 1 mismatch),
as well as ADAPTR
(with 1 deletion), and also ADAPPTER
(with 1
insertion). If insertions and deletions are disabled with --no-indels
,
then mismatches are the only type of errors.
The level of error tolerance is determined by a maximum error rate, which is 0.1 (=10%) by default. An adapter occurrence is only found if the actual error rate of the match does not exceed the maximum error rate. The actual error rate is computed as the number of errors in the match divided by the length of the matching part of the adapter.
For example, an adapter match of length 8 containing 1 error has an error rate of 1/8=0.125. At the default maximum error rate 0.1, it would not be found, but a match of length 10 containing 1 error has an error rate of 1/10=0.1 and would be found.
Relating the number of errros to the length of the matching part of the adapter is important because Cutadapt allows for partial adapter occurrences (for the non-anchored adapter types). If only the absolute number of errors were used, shorter matches would be favored unfairly. For example, assume an adapter has 30 bases and we allow three errors over that length. If we allowed these three errors even for a partial occurrences of, for example, four bases, we can immediately see that this results in unexpected matches. Using the error rate as a criterion helps to keep sensitivity and specificity roughly the same over the possible lengths of the matches.
The -e
option on the command line allows you to change the maximum error rate.
If the value is between 0 and 1 (but not 1 exactly), then this sets the maximum
error rate directly for all specified adapters. The default is -e 0.1
. You
can also use the adapter-specific parameter max_error_rate
or max_errors
or just e
to override the default for a single adapter only.
Examples: -a "ADAPTER;max_error_rate=0.15"
, -a "ADAPTER;e=0.15"
(the quotation marks are necessary).
Alternatively, you can also specify a value of 1 or greater as the number of
allowed errors, which is then converted to a maximum error rate for each adapter
individually. For example, with an adapter of length 10, using -e 2
will
set the maximum error rate to 0.2 for an adapter of length 10.
The value does not have to be an integer, and if you use an adapter type
that allows partial matches, you may want to add 0.5 to the desired number of
errors, which achieves that even slightly shorter than full-lengths
matches will be allowed at the specified number of errors. In short, if you
want to allow two errors, use -e 2.5
.
This also works in the adapter-specific parameters.
Examples: -a "ADAPTER;e=1"
, -a "ADAPTER;max_errors=2.5"
. Note that
e
, max_error_rate
and max_errors
are all equivalent and the
decision whether a rate or an absolute number is meant is based on
whether the given value is less than 1 or not.
The number of errors allowed for a given adapter match length is also shown under the “No. of allowed errors” heading in the report that Cutadapt prints:
Sequence: 'SOMEADAPTER'; Length: 11; Trimmed: 2 times.
No. of allowed errors:
0-9 bp: 0; 10-11 bp: 1
This tells us: For match lengths of 0-9 bases, zero errors are allowed and for matches of length 10-11 bases, one error is allowed.
See also the section on details of the alignment algorithm.
N wildcard characters¶
Any N
wildcard characters in the adapter sequence are skipped when
computing the error rate. That is, they do not contribute to the length of
a match. For example, the adapter sequence ACGTACNNNNNNNNGTACGT
has a length
of 20, but only 12 non-N
-characters. At a maximum error rate of 0.1, only
one error is allowed if this sequence is found in full in a read because
12·0.1=1.2, which is 1 when rounded down.
This is done because N
bases cannot contribute to the number of errors.
In previous versions, N
wildcard characters did contribute to the match
length, but this artificially inflates the number of allowed errors. For example,
an adapter like N{18}CC
(18 N
wildcards followed by CC
) would
effectively match anywhere because the default error rate of 0.1 would allow for
two errors, but there are only two non-N
bases in the particular adapter.
However, even in previous versions, the location with the greatest number of matching bases is chosen as the best location for an adapter, so in many cases the adapter would still be placed properly.
Minimum overlap (reducing random matches)¶
Since Cutadapt allows partial matches between the read and the adapter sequence for most adapter types, short matches can occur by chance, leading to erroneously trimmed bases. For example, just by chance, we expect that roughly 25% of all reads end with a base that is identical to the first base of the adapter. To reduce the number of falsely trimmed bases, the alignment algorithm requires that at least three bases of the adapter are aligned to the read.
This minimum overlap length can be changed globally (for all adapters) with the parameter
--overlap
(or its short version -O
). The option is ignored for
anchored adapters since these do not allow partial matches.
Alternatively, use the adapter-specific
parameter min_overlap
to change it for a single adapter only. Example:
-a "ADAPTER;min_overlap=5"
(the quotation marks are necessary).
For anchored adapters, attempting to set a minimum overlap this way will
result in an error.
In linked adapters, the minimum overlap length is applied separately to the 5’ and the 3’ adapter.
If a read contains a partial adapter sequence shorter than the minimum overlap length, no match will be found (and therefore no bases are trimmed).
Requiring at least three bases to match is quite conservative. Even if no minimum overlap was required, we can compute that we lose only about 0.44 bases per read on average, see Section 2.3.3 in my thesis. With the default minimum overlap length of 3, only about 0.07 bases are lost per read.
When choosing an appropriate minimum overlap length, take into account that true adapter matches are also lost when the overlap length is higher than zero, reducing Cutadapt’s sensitivity.
It is possible that fewer bases are removed from a read than the minimum overlap length seems to imply. The overlap length is the number of bases in the adapter that got aligned to the read, which means that if there are deletions in the adapter, the corresponding part in the read will be shorter. (This is only relevant when the maximum allowed error rate and/or the minimum overlap length are changed such that at least one error is allowed over the given length.)
Allowing partial matches at both ends¶
The regular 5’ and 3’ adapter types allow partial adapter occurrences only
at the 5’ and 3’ end of the read, respectively. To allow partial matches at both ends,
you can use the anywhere
adapter-specific parameter.
A 3’ adapter specified via -a ADAPTER
will be found even
when it occurs partially at the 3’ end, as in mysequenceADAPT
. However,
it will by default not be found if it occurs partially at the 5’ end, as in
APTERmysequence
. To find the adapter in both cases, specify
the adapter as -a "ADAPTER;anywhere"
.
Similarly, for a 5’ adapter specified via -g ADAPTER
, partial matches at
the 3’ end are not found, as in mysequenceADAPT
. To allow partial matches
at both ends, use -g "ADAPTER;anywhere"
.
Note
With anywhere
, partial matches at the end that is usually not allowed
to be matched will result in empty reads! This means that short random
matches have a much greater detrimental effect and you should
increase the minimum overlap length.
Searching reverse complements¶
By default, Cutadapt expects adapters to be given in the same orientation (5’ to 3’) as the reads. That is, neither reads nor adapters are reverse-complemented.
To change this, use option --revcomp
or its abbreviation --rc
. If given, Cutadapt searches
both the read and its reverse complement for adapters. If the reverse complemented read yields
a better match, then that version of the read is kept. That is, the output file will contain the
reverse-complemented sequence. This can be used to “normalize” read orientation/strandedness.
To determine which version of the read yields the better match, the full adapter search (possibly
multiple rounds if --times
is used) is done independently on both versions, and the version that
results in the higher number of matching nucleotides is considered to be the better one.
The name of a reverse-complemented read is changed by adding a space and rc
to it. (Please
file an issue if you would like this to be configurable.)
The report will show the number of reads that were reverse-complemented, like this:
Total reads processed: 60
Reads with adapters: 50 (83.3%)
Reverse-complemented: 20 (33.3%)
Here, 20 reverse-complemented reads contain an adapter and 50 - 20 = 30 reads that did not need to be reverse-complemented contain an adapter.
Option --revcomp
is currently available only for single-end data.
New in version 2.8.
Specifying adapter sequences¶
Wildcards¶
All IUPAC nucleotide codes
(wildcard characters, degenerate bases) are supported.
For example, use an N
in the adapter
sequence to match any nucleotide in the read, or use -a YACGT
for an adapter
that matches both CACGT
and TACGT
. The wildcard character N
is
useful for trimming adapters with an embedded variable barcode:
cutadapt -a ACGTAANNNNTTAGC -o output.fastq input.fastq
Even the X
wildcard that does not match any nucleotide is supported. If
used as in -a ADAPTERX
or -g XADAPTER
, it acquires a special meaning for
the matching algorithm
and disallows internal adapter matches.
Wildcard characters are by default only allowed in adapter sequences and
are not recognized when they occur in a read. This is to avoid matches in reads
that consist of many (often low-quality) N
bases. Use
--match-read-wildcards
to enable wildcards also in reads.
Use the option -N
to disable interpretation of wildcard characters even in
the adapters. If wildcards are disabled entirely, that is, when you use -N
and do not use --match-read-wildcards
, then Cutadapt compares characters
by their ASCII value. Thus, both the read and adapter can be arbitrary strings
(such as SEQUENCE
or ADAPTER
as used here in the examples).
Repeated bases¶
If you have many repeated bases in the adapter sequence, such as many N
s or
many A
s, you do not have to spell them out. For example, instead of writing
ten A
in a row (AAAAAAAAAA
), write A{10}
instead. The number within
the curly braces specifies how often the character that preceeds it will be
repeated. This works also for IUPAC wildcard characters, as in N{5}
.
It is recommended that you use quotation marks around your adapter sequence if you use this feature. For poly-A trimming, for example, you would write:
cutadapt -a "A{100}" -o output.fastq input.fastq
Modifying reads¶
This section describes in which ways reads can be modified other than adapter removal.
--action
changes what is done when an adapter is found¶
The --action
option can be used to change what is done when an adapter match
is found in a read.
The default is --action=trim
, which will remove the adapter and the
sequence before or after it from the read. For 5’ adapters, the adapter and
the sequence preceding it is removed. For 3’ adapters, the adapter and the
sequence following it is removed. Since linked adapters are a combination of
a 5’ and 3’ adapter, in effect only the sequence between the 5’ and the 3’
adapter matches is kept.
With --action=retain
, the read is trimmed, but the adapter sequence itself
is not removed. Up- and downstream sequences are removed in the same way as
for the trim
action. For linked adapters, both adapter sequences are kept.
Note
Because it is somewhat unclear what should happen, --action=retain
can
at the moment not be combined with --times
(multiple rounds of adapter
removal).
Use --action=none
to not change the read even if there is a match.
This is useful because the statistics will still be updated as before
and because the read will still be considered “trimmed” for the read
filtering options. Combining this with --untrimmed-output
, for
example, can be used to copy reads without adapters to a different
file. Other read modification options, if used, may still change
the read.
Use --action=mask
to write N
characters to those parts of the read
that would otherwise have been removed.
Use --action=lowercase
to change to lowercase those parts of the read that
would otherwise have been removed. The rest is converted to uppercase.
New in version 3.1: The retain
action.
Removing a fixed number of bases¶
By using the --cut
option or its abbreviation -u
, it is possible to
unconditionally remove bases from the beginning or end of each read. If
the given length is positive, the bases are removed from the beginning
of each read. If it is negative, the bases are removed from the end.
For example, to remove the first five bases of each read:
cutadapt -u 5 -o trimmed.fastq reads.fastq
To remove the last seven bases of each read:
cutadapt -u -7 -o trimmed.fastq reads.fastq
The -u
/--cut
option can be combined with the other options, but
the --cut
is applied before any adapter trimming.
Quality trimming¶
The -q
(or --quality-cutoff
) parameter can be used to trim
low-quality ends from reads. If you specify a single cutoff value, the
3’ end of each read is trimmed:
cutadapt -q 10 -o output.fastq input.fastq
For Illumina reads, this is sufficient as their quality is high at the beginning, but degrades towards the 3’ end.
It is also possible to also trim from the 5’ end by specifying two comma-separated cutoffs as 5’ cutoff,3’ cutoff. For example,
cutadapt -q 15,10 -o output.fastq input.fastq
will quality-trim the 5’ end with a cutoff of 15 and the 3’ end with a cutoff
of 10. To only trim the 5’ end, use a cutoff of 0 for the 3’ end, as in
-q 15,0
.
Quality trimming is done before any adapter trimming.
For paired-end data, quality trimming is by default applied to both reads using
the same cutoff(s). Use option -Q
to specify different cutoffs for R2:
cutadapt -q 5 -Q 15,20 -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq
To disable quality-trimming of R2, use -Q 0
.
By default, quality values are assumed to be encoded as
ascii(phred quality + 33). Nowadays, this should always be the case.
Some old Illumina FASTQ files encode qualities as ascii(phred quality + 64).
For those, you must add --quality-base=64
to the command line.
A description of the quality-trimming algorithm is also available. The algorithm is the same as used by BWA.
Quality trimming of reads using two-color chemistry (NextSeq)¶
Some Illumina instruments use a two-color chemistry to encode the four bases.
This includes the NextSeq and the NovaSeq. In those instruments, a
‘dark cycle’ (with no detected color)
encodes a G
. However, dark cycles also occur when sequencing “falls
off” the end of the fragment. The read then contains a run of high-quality, but
incorrect “G” calls
at its 3’ end.
Since the regular quality-trimming algorithm cannot deal with this situation,
you need to use the --nextseq-trim
option:
cutadapt --nextseq-trim=20 -o out.fastq input.fastq
This works like regular quality trimming (where one would use -q 20
instead), except that the qualities of G
bases are ignored.
New in version 1.10.
Shortening reads to a fixed length¶
To shorten each read down to a certain length, use the --length
option or
the short version -l
:
cutadapt -l 10 -o output.fastq.gz input.fastq.gz
This shortens all reads from input.fastq.gz
down to 10 bases. The removed bases
are those on the 3’ end.
If you want to remove a fixed number of bases from each read, use the –cut option instead.
Modifying read names¶
If you feel the need to modify the names of processed reads, some of the following options may be useful.
These options exist; they are explained in more detail in the following sections:
--rename
changes a read name according to a template.--prefix
(or-x
) adds a prefix to read names.--suffix
(or-y
) adds a suffix to read names.--length-tag
updates a “length tag” such aslength=
with the correct read length--strip-suffix
removes a known suffix from read names
The --prefix
and --suffix
options are outdated as they do not ensure that paired-end
read names remain consistent, and you should prefer to use --rename
.
--prefix
and --suffix
can currently not be used together with --rename
.
--rename
renames reads¶
The --rename
option can be used to rename both single-end and paired-end reads.
This section describes how it can be used to rename single-end reads.
We use the following terminology: The FASTQ or FASTA header line consists of a read ID and is optionally followed by a separator (whitespace) and a comment.
For example, in this FASTQ header, the read ID is read1234
and the comment is value=17
(sequence and qualities not shown):
@read1234 value=17
The --rename
option expects a template string such as
{id} extra_info {adapter_name}
as a parameter. It can contain regular text
and placeholders that consist of a name enclosed in curly braces ({placeholdername}
).
The read name will be set to the template string in which the placeholders are replaced with the actual values relevant for the current read.
The following placeholders are currently available for single-end reads:
{header}
– the full, unchanged header{id}
– the read ID, that is, the part of the header before the first whitespace{comment}
– the part of the header after the whitespace following the ID{adapter_name}
– the name of adapter that was found in this read orno_adapter
if there was no adapter match. If you use--times
to do multiple rounds of adapter matching, this is the name of the last found adapter.{match_sequence}
– the sequence of the read that matched the adapter (including errors). If there was no adapter match, this is set to an empty string. If you use a linked adapter, this is to the two matching strings, separated by a comma.{cut_prefix}
– the prefix removed by the--cut
(or-u
) option (that is, when used with a positive length argument){cut_suffix}
– the suffix removed by the--cut
(or-u
) option (that is, when used with a negative length argument){rc}
– this is replaced with the stringrc
if the read was reverse complemented. This only applies when reverse complementing was requested.
For example, assume you have this input read in in.fasta
:
>myread extra info
ACGTAAAATTTTCCCC
Running the command
cutadapt -a myadapter=TTTT -u 4 --rename='{id} barcode={cut_prefix} adapter={adapter_name} {comment}' in.fasta
Will result in this modified read:
>myread barcode=ACGT adapter=myadapter extra info
AAAA
New in version 3.2: The {rn}
placeholder.
New in version 3.3: The {rc}
placeholder.
New in version 3.6: The {match_sequence}
placeholder.
--rename
also renames paired-end reads¶
If the --rename
option is used with paired-end data, the template is applied
separately to both R1 and R2. That is, for R1, the placeholders are replaced with values
from R1, and for R2, the placeholders are replaced with values from R2. For example,
{comment}
becomes R1’s comment in R1 and it becomes R2’s comment in R2.
As another example, using --rename='{id} please note: {comment}'
, the paired-end reads
>myread important comment
...
>myread also quite important
...
are renamed to
>myread please note: important comment
...
>myread please note: also quite important
...
For paired-end data, the placeholder {rn}
is available (“read number”),
and it is replaced with 1
in R1 and with 2
in R2.
In addition, it is possible to write a placeholder as {r1.placeholdername}
or
{r2.placeholdername}
, which always takes the replacement value from R1 or R2,
respectively.
For example, assume R1 starts with a 4 nt barcode that you want to “move” from the
sequence into the ID of both reads. You can use
--cut=4 --rename='{id}_{r1.cut_prefix} {comment}'
and the read pair
>myread this is R1
ACGTAAAATTTT
>myread this is R2
GGGGCCCC
will be changed to
>myread_ACGT this is R1
AAAATTTT
>myread_ACGT this is R2
GGGGCCCC
The {r1.placeholder}
and {r2.placeholder}
notation is available for all
placeholders except {rn}
and {id}
because the read ID needs to be
identical for both reads.
In general, the read IDs of R1 and R2 need to be identical. Cutadapt
enforces this when reading paired-end FASTQ files, except that it allows a single trailing
“1” or “2” as the only difference between the read IDs. This allows for read IDs ending in
/1
and /2
(some old formats are like this) or .1
and .2
(fastq-dump
produces this).
If you use --rename
, Cutadapt will also enforce this when writing paired-end reads.
New in version 3.2: The --rename
option
Other read name modification¶
Use -y
(or its alias --suffix
) to append a text to read names. The given string can
contain the placeholder {name}
, which will be replaced with the name of the
adapter found in that read. For example, writing
cutadapt -a adapter1=ACGT -y ' we found {name}' input.fastq
changes a read named read1
to read1 we found adapter1
if the adapter
ACGT
was found.
The option -x
(and its alias --prefix
) work the same, except that the text
is added in front of the read name. For both options, spaces need to be
specified explicitly, as in the above example. If no adapter was found in a
read, the text no_adapter
is inserted for {name}
.
We recommend that you no longer use the -x
/--prefix
/-y
/--suffix
options and use --rename
instead, which is more general.
In order to remove a suffix of each read name, use --strip-suffix
.
Some old 454 read files contain the length of the read in the name:
>read1 length=17
ACGTACGTACAAAAAAA
If you want to update this to the correct length after trimming, use the option
--length-tag
. In this example, this would be --length-tag 'length='
.
After trimming, the read would perhaps look like this:
>read1 length=10
ACGTACGTAC
Read modification order¶
The read modifications described above are applied in the following order to each read. Steps not requested on the command-line are skipped.
- Unconditional base removal with
--cut
- Quality trimming (
-q
) - Adapter trimming (
-a
,-b
,-g
and uppercase versions) - Read shortening (
--length
) - N-end trimming (
--trim-n
) - Length tag modification (
--length-tag
) - Read name suffix removal (
--strip-suffix
) - Addition of prefix and suffix to read name (
-x
/--prefix
and-y
/--suffix
) - Read renaming according to
--rename
- Replace negative quality values with zero (zero capping)
Filtering reads¶
By default, all processed reads, no matter whether they were trimmed or not,
are written to the output file specified by the -o
option (or to standard
output if -o
was not provided). For paired-end reads, the second read in a
pair is always written to the file specified by the -p
option.
The options described here make it possible to filter reads by either discarding them entirely or by redirecting them to other files. When redirecting reads, the basic rule is that each read is written to at most one file. You cannot write reads to more than one output file.
Filters are applied to all processed reads, no matter whether they have been modified by adapter- or quality trimming.
--minimum-length LENGTH
or-m LENGTH
Discard processed reads that are shorter than LENGTH.
If you do not use this option, reads that have a length of zero (empty reads) are kept in the output. Some downstream tools may have problems with zero-length sequences. In that case, specify at least
-m 1
.--too-short-output FILE
- Instead of discarding the reads that are too short according to
-m
, write them to FILE (in FASTA/FASTQ format). --maximum-length LENGTH
or-M LENGTH
- Discard processed reads that are longer than LENGTH.
--too-long-output FILE
- Instead of discarding reads that are too long (according to
-M
), write them to FILE (in FASTA/FASTQ format). --untrimmed-output FILE
- Write all reads without adapters to FILE (in FASTA/FASTQ format) instead of writing them to the regular output file.
--discard-trimmed
- Discard reads in which an adapter was found.
--discard-untrimmed
- Discard reads in which no adapter was found. This has the same effect as
specifying
--untrimmed-output /dev/null
.
The options --too-short-output
and --too-long-output
are applied first.
This means, for example, that a read that is too long will never end up in the
--untrimmed-output
file when --too-long-output
was given, no matter
whether it was trimmed or not.
The options --untrimmed-output
, --discard-trimmed
and -discard-untrimmed
are mutually exclusive.
The following filtering options do not have a corresponding option for redirecting reads. They always discard those reads for which the filtering criterion applies.
--max-n COUNT_or_FRACTION
- Discard reads with more than COUNT
N
bases. IfCOUNT_or_FRACTION
is a number between 0 and 1, it is interpreted as a fraction of the read length --max-expected-errors ERRORS
or--max-ee ERRORS
- Discard reads with more than ERRORS expected errors. The number of expected errors is computed as described in Edgar et al. (2015), (Section 2.2).
--discard-casava
- Discard reads that did not pass CASAVA filtering. Illumina’s CASAVA pipeline in
version 1.8 adds an is_filtered header field to each read. Specifying this
option, the reads that did not pass filtering (these are the reads that have
a
Y
for is_filtered) will be discarded. Reads for which the header cannot be recognized are kept.
Trimming paired-end reads¶
Cutadapt supports trimming of paired-end reads. To enable this, provide two
input files and a second output file with the -p
option (this is the short
form of --paired-output
). This is the basic command line syntax:
cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq
Here, the input reads are in reads.1.fastq
and reads.2.fastq
, and the
result will be written to out.1.fastq
and out.2.fastq
.
In paired-end mode, the options -a
, -b
, -g
and -u
that also
exist in single-end mode are applied to the forward reads only. To modify
the reverse read, these options have uppercase versions -A
, -B
,
-G
and -U
that work just like their counterparts.
In the example above, ADAPTER_FWD
will therefore be trimmed from the
forward reads and ADAPTER_REV
from the reverse reads.
Single-end/R1 option | Corresponding option for R2 |
---|---|
--adapter , -a |
-A |
--front , -g |
-G |
--anywhere , -b |
-B |
--cut , -u |
-U |
--output , -o |
--paired-output , -p |
In paired-end mode, Cutadapt checks whether the input files are
properly paired. An error is raised if one of the files contains more reads than
the other or if the read names in the two files do not match. The read name
comparison ignores a trailing /1
or /2
to allow processing some old
Illumina paired-end files.
In some cases, it works to run Cutadapt twice in single-end mode on the input files, but we recommend against it as this skips the consistency checks that Cutadapt can do otherwise.
Also, as soon as you start to use one of the filtering options that discard reads, it is mandatory you process both files at the same time to make sure that the output files are kept synchronized. If a read is removed from one of the files, Cutadapt will always ensure that it is also removed from the other file.
The following command-line options are applied to both reads:
-q
(along with--quality-base
)--times
applies to all the adapters given--trim-n
--action
--length
--length-tag
--prefix
,--suffix
The following limitations still exist:
- The
--info-file
,--rest-file
and--wildcard-file
options write out information only from the first read.
Filtering paired-end reads¶
The filtering options listed above can also be used when trimming paired-end data.
Importantly, Cutadapt always discards both reads of a pair if it determines that the pair should be discarded. This ensures that the reads in the output files are in sync. (If you don’t want or need this, you can run Cutadapt separately on the R1 and R2 files.)
The same applies also to the options that redirect reads to other files if they
fulfill a filtering criterion, such as
--too-short-output
/--too-short-paired-output
. That is, the reads are
always sent in pairs to these alternative output files.
The --pair-filter
option determines how to combine the filters for
R1 and R2 into a single decision about the read pair.
The default is --pair-filter=any
, which means that a read pair is discarded
(or redirected) if at least one of the reads (R1 or R2) fulfills the filtering criterion.
As an example, if option --minimum-length=20
is used and paired-end data is
processed, a read pair is discarded if at least one of the reads is shorter than
20 nt.
With --pair-filter=both
, you can require that filtering criteria must apply
to both reads in order for a read pair to be discarded.
Finally, --pair-filter=first
will make a decision about the read pair
by inspecting whether the filtering criterion applies to the first read,
ignoring the second read.
The following table describes the effect for some filtering options.
Filtering option | With --pair-filter=any , the pair
is discarded if … |
With --pair-filter=both , the pair
is discarded if … |
---|---|---|
--minimum-length |
one of the reads is too short | both reads are too short |
--maximum-length |
one of the reads is too long | both reads are too long |
--discard-trimmed |
one of the reads contains an adapter | both reads contain an adapter |
--discard-untrimmed |
one of the reads does not contain an adapter | both reads do not contain an adapter |
--max-n |
one of the reads contains too many N bases |
both reads contain too many N bases |
There is currently no way to change the pair-filter mode for each filter individually.
Note
As an exception, when you specify adapters only for R1 (-a
/-g
/-b
) or only for
R2 (-A
/-G
/-B
), then the --pair-filter
mode for --discard-untrimmed
is
forced to be both
(and accordingly, also for the --untrimmed-(paired-)output
options).
Otherwise, with the default --pair-filter=any
setting, all pairs would be considered
untrimmed because it would always be the case that one of the reads in the pair does not contain
an adapter.
The pair-filter mode for the other filtering options, such as --minimum-length
, is
not overridden in the same way and remains any
unless changed explicitly with the
--pair-filter
option.
These are the paired-end specific filtering and output options:
--minimum-length LENGTH1:LENGTH2
or-m LENGTH1:LENGTH2
- When trimming paired-end reads, the minimum lengths for R1 and R2 can be specified
separately by separating them with a colon (
:
). If the colon syntax is not used, the same minimum length applies to both reads, as discussed above. Also, one of the values can be omitted to impose no restrictions. For example, with-m 17:
, the length of R1 must be at least 17, but the length of R2 is ignored. --maximum-length LENGTH1:LENGTH2
or-M LENGTH1:LENGTH2
- Maximum lengths can also be specified separately, see the explanation of
-m
above. --paired-output FILE
or-p FILE
- Write the second read of each processed pair to FILE (in FASTA/FASTQ format).
--untrimmed-paired-output FILE
- Used together with
--untrimmed-output
. The second read in a pair is written to this file when the processed pair was not trimmed. --too-short-paired-output FILE
- Write the second read in a pair to this file if pair is too short. Use
together with
--too-short-output
. --too-long-paired-output FILE
- Write the second read in a pair to this file if pair is too long. Use
together with
--too-long-output
. --pair-filter=(any|both|first)
- Which of the reads in a paired-end read have to match the filtering criterion in order for it to be filtered.
Note that the option names can be abbreviated as long as it is clear which
option is meant (unique prefix). For example, instead of --untrimmed-output
and --untrimmed-paired-output
, you can write --untrimmed-o
and
--untrimmed-p
.
New in version 1.18: --pair-filter=first
Paired adapters (dual indices)¶
When processing paired-end data, Cutadapt has two sets of adapters to work with: The ones that
are to be found and removed in the forward read (R1), specified with -a
/-g
/-b
,
and the ones to be found and removed in the reverse read (R2), specified with -A
/-G
/-B
.
Normally, the program looks at the R1 and R2 reads independently. That is, the best matching R1 adapter is removed from R1 and the best matching R2 adapter is removed from R2.
To change this, the option --pair-adapters
can be used. It causes each R1 adapter to be
paired up with its corresponding R2 adapters. The first R1 adapter will be paired up with the first
R2 adapter, and so on. The adapters are then always removed in pairs from a read pair. It is an
error if the number of provided adapters is not identical for the R1 and R2 sets.
This option was added to aid in demultiplexing Illumina libraries that contain unique dual indexes (UDI). This scheme, also called “non-redundant indexing”, uses 96 unique i5 indices and 96 unique i7 indices, which are only used in pairs, that is, the first i5 index is always used with the first i7 index and so on.
Note
If the adapters do not come in pairs, but all combinations are possible, see the section about combinatorial demultiplexing.
An example:
cutadapt --pair-adapters -a AAAAA -a GGGG -A CCCCC -a TTTT -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq
Here, the adapter pairs are (AAAAA
, CCCCC
) and (GGGG
, TTTT
). That is, paired-end
reads will only be trimmed if either
AAAAA
is found in R1 andCCCCC
is found in R2,- or
GGGG
is found in R1 andTTTT
is found in R2.
The --pair-adapters
option can be used also when demultiplexing.
There is one limitation of the algorithm at the moment: The program looks for the best-matching R1 adapter first and then checks whether the corresponding R2 adapter can be found. If not, the read pair remains unchanged. However, it is in theory possible that a different R1 adapter that does not fit as well would have a partner that can be found. Some read pairs may therefore remain untrimmed.
New in version 2.1.
Interleaved paired-end reads¶
Cutadapt supports reading and writing paired-end reads from a single FASTQ file
in which the entries for the first and second read from each pair alternate.
The first read in each pair comes before the second. This is called “interleaved”
format. Enable this file format by adding the --interleaved
option to the
command-line. Then, if you provide only a single file where usually two would be
expected, reads are automatically read or written interleaved.
For example, to read interleaved from reads.fastq
and to write interleaved to trimmed.fastq
:
cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.fastq reads.fastq
In the following example, the input reads.fastq
is interleaved, but output is
written to two files trimmed.1.fastq
and trimmed.2.fastq
:
cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq -p trimmed.2.fastq reads.fastq
Reading two-file input and writing interleaved is also possible by providing a second input file:
cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq reads.1.fastq reads.2.fastq
The following options also supported interleaved output:
* ``--untrimmed-output`` (omit ``--untrimmed-paired-output``)
* ``--too-short-output`` (omit ``--too-short-paired-output``)
* ``--too-long-output`` (omit ``--too-long-paired-output``)
If you omit --interleaved
but trim paired-end files, the above options must be used in pairs.
Cutadapt will detect if an input file is not properly interleaved by checking whether read names match and whether the file contains an even number of entries.
Trimming paired-end reads separately¶
Warning
Trimming paired-end data in this way is not recommended as it
bypasses all paired-end error-checking, such as checking whether
the number of reads is the same in both files. You should use
the normal paired-end trimming mode with the -o
/--p
options described above.
If you do not use any of the filtering options that discard reads, such
as --discard
, --minimum-length
or --maximum-length
, you can run
Cutadapt on each file separately:
cutadapt -a ADAPTER_FWD -o trimmed.1.fastq.gz reads1.fastq.gz
cutadapt -a ADAPTER_REV -o trimmed.2.fastq.gz reads2.fastq.gz
You can use the options that are listed under ‘Additional modifications’ in Cutadapt’s help output without problems. For example, if you want to quality-trim the first read in each pair with a threshold of 10, and the second read in each pair with a threshold of 15, then the commands could be:
cutadapt -q 10 -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
cutadapt -q 15 -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq
Note
Previous Cutadapt versions (up to 1.18) had a “legacy mode” that was
activated under certain conditions and in which the read-modifying
options such as -q
would only apply to the forward/R1 reads.
This mode no longer exists.
Multiple adapters¶
It is possible to specify more than one adapter sequence by using the options
-a
, -b
and -g
more than once. Any combination is allowed, such as
five -a
adapters and two -g
adapters. Each read will be searched for
all given adapters, but only the best matching adapter is removed. (But it
is possible to trim more than one adapter from each
read). This is how a command may look to trim one of two
possible 3’ adapters:
cutadapt -a TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq
The adapter sequences can also be read from a FASTA file. Instead of giving an
explicit adapter sequence, you need to write file:
followed by the name of
the FASTA file:
cutadapt -a file:adapters.fasta -o output.fastq input.fastq
All of the sequences in the file adapters.fasta
will be used as 3’
adapters. The other adapter options -b
and -g
also support this.
The file:
syntax can be combined with the regular way of specifying an
adapter. But no matter how you specify multiple adapter sequences, remember
that only the best matching adapter is trimmed from each read.
When Cutadapt has multiple adapter sequences to work with, either specified explicitly on the command line or via a FASTA file, it decides in the following way which adapter should be trimmed:
- All given adapter sequences are matched to the read.
- Adapter matches where the overlap length (see the
-O
parameter) is too small or where the error rate is too high (-e
) are removed from further consideration. - Among the remaining matches, the one with the greatest number of matching bases is chosen.
- If there is a tie, the first adapter wins. The order of adapters is the order in which they are given on the command line or in which they are found in the FASTA file.
If your adapter sequences are all similar and differ only by a variable barcode sequence, you can use a single adapter sequence instead that contains wildcard characters.
If you want to search for a combination of a 5’ and a 3’ adapter, you may want to provide them as a single so-called “linked adapter” instead.
Named adapters¶
Cutadapt reports statistics for each adapter separately. To identify the adapters, they are numbered and the adapter sequence is also printed:
=== Adapter 1 ===
Sequence: AACCGGTT; Length 8; Trimmed: 5 times.
If you want this to look a bit nicer, you can give each adapter a name in this way:
cutadapt -a My_Adapter=AACCGGTT -o output.fastq input.fastq
The actual adapter sequence in this example is AACCGGTT
and the name
assigned to it is My_Adapter
. The report will then contain this name in
addition to the other information:
=== Adapter 'My_Adapter' ===
Sequence: TTAGACATATCTCCGTCG; Length 18; Trimmed: 5 times.
When adapters are read from a FASTA file, the sequence header is used as the adapter name.
Adapter names are also used in column 8 of info files.
Trimming more than one adapter from each read¶
By default, at most one adapter sequence is removed from each read, even if
multiple adapter sequences were provided. This can be changed by using the
--times
option (or its abbreviated form -n
). Cutadapt will then search
for all the given adapter sequences repeatedly, either until no adapter match
was found or until the specified number of rounds was reached.
As an example, assume you have a protocol in which a 5’ adapter gets ligated to your DNA fragment, but it’s possible that the adapter is ligated more than once. So your sequence could look like this:
ADAPTERADAPTERADAPTERmysequence
To be on the safe side, you assume that there are at most five copies of the adapter sequence. This command can be used to trim the reads correctly:
cutadapt -g ^ADAPTER -n 5 -o output.fastq.gz input.fastq.gz
To search for a combination of a 5’ and a 3’ adapter, have a look at the support for “linked adapters” instead, which works better for that particular case because it is allows you to require that the 3’ adapter is trimmed only when the 5’ adapter also occurs, and it cannot happen that the same adapter is trimmed twice.
Before Cutadapt supported linked adapters, the --times
option was the
recommended way to search for 5’/3’ linked adapters. For completeness, we
describe how it was done. For example, when the 5’ adapter is FIRST and the
3’ adapter is SECOND, then the read could look like this:
FIRSTmysequenceSECOND
That is, the sequence of interest is framed by the 5’ and the 3’ adapter. The following command would be used to trim such a read:
cutadapt -g ^FIRST -a SECOND -n 2 ...
Demultiplexing¶
Cutadapt supports demultiplexing, which means that reads are written to different
output files depending on which adapter was found in them. To use this, include
the string {name}
in the name of the output file and give each adapter
a name.
The path is then interpreted as a template and each trimmed read is written
to the path in which {name}
is replaced with the name of the adapter that
was found in the read. Reads in which no adapter was found will be written to a
file in which {name}
is replaced with unknown
.
Example:
cutadapt -a one=TATA -a two=GCGC -o trimmed-{name}.fastq.gz input.fastq.gz
This command will create the three files demulti-one.fastq.gz
,
demulti-two.fastq.gz
and demulti-unknown.fastq.gz
.
More realistically, your “adapters” would actually be barcode sequences that you
will want to provide in a FASTA file. Here is a
made-up example for such a barcodes.fasta
file:
>barcode01
^TTAAGGCC
>barcode02
^TAGCTAGC
>barcode03
^ATGATGAT
Our barcodes are located at the 5’ end of the R1 read, so we made sure to use
anchored 5’ adapters by prefixing
each sequence with the ^
character. We will then use -g file:barcodes.fasta
,
where the -g
option specifies that our adapters are 5’ adapters.
These barcode sequences have a length of 8, which means that Cutadapt
would not allow any errors when matching them: The default is to allow 10%
errors, but 10% of 8 is 0.8, which is rounded down to 0. To allow one
error, we increase the maximum error rate to 15% with -e 0.15
.
Finally, we also use --no-indels
because we don’t want to allow
insertions or deletions. Also, with the --no-indels
option, Cutadapt can
use a different algorithm and demultiplexing will be many times faster.
Here is the final command:
cutadapt -e 0.15 --no-indels -g file:barcodes.fasta -o "trimmed-{name}.fastq.gz" input.fastq.gz
Demultiplexing is also supported for paired-end data if you provide the {name}
template
in both output file names (-o
and -p
). Example:
cutadapt -e 0.15 --no-indels -g file:barcodes.fasta -o trimmed-{name}.1.fastq.gz -p trimmed-{name}.2.fastq.gz input.1.fastq.gz input.2.fastq.gz
Paired-end demultiplexing always uses the adapter matches of the first read to decide where a
read should be written. If adapters for read 2 are given (-A
/-G
), they are detected and
removed as normal, but these matches do not influence where the read pair is written. This is
to ensure that read 1 and read 2 are always synchronized.
To demultiplex using a barcode that is located on read 2, you can swap the roles of R1 and R2 for both the input and output files
cutadapt -e 0.15 --no-indels -g file:barcodes.fasta -o trimmed-{name}.2.fastq.gz -p trimmed-{name}.1.fastq.gz input.2.fastq.gz input.1.fastq.gz
If you do this in a script or pipeline, it may be a good idea to add a comment to clarify that this reversal of R1 and R2 is intended.
More advice on demultiplexing:
- You can use
--untrimmed-output
to change the name of the output file that receives the untrimmed reads (those in which no barcode could be found). - Similarly, you can use
--untrimmed-paired-output
to change the name of the output file that receives the untrimmed R2 reads. - If you want to demultiplex, but keep the barcode in the reads, use the option
--action=none
.
Demultiplexing paired-end reads with combinatorial dual indexes¶
Illumina’s combinatorial dual indexing strategy uses a set of indexed adapters on R1 and another one on R2. Unlike unique dual indexes (UDI), all combinations of indexes are possible.
For demultiplexing this type of data (“combinatorial demultiplexing”), it is necessary to write each read pair to an output file depending on the adapters found on R1 and R2.
Doing this with Cutadapt is similar to doing normal demultiplexing as described above, but you need
to use {name1}
and {name2}
in both output file name templates. For example:
cutadapt \
-e 0.15 --no-indels \
-g file:barcodes_fwd.fasta \
-G file:barcodes_rev.fasta \
-o {name1}-{name2}.1.fastq.gz -p {name1}-{name2}.2.fastq.gz \
input.1.fastq.gz input.2.fastq.gz
The {name1}
will be replaced with the name of the best-matching R1 adapter and {name2}
will
be replaced with the name of the best-matching R2 adapter.
If there was no match of an R1 adapter, {name1}
is set to “unknown”, and if there is no match of
an R2 adapter, {name2}
is set to “unknown”. To discard read pairs for which one or both adapters
could not be found, use --discard-untrimmed
.
The --untrimmed-output
and --untrimmed-paired-output
options cannot be used.
Read the demultiplexing section for how to choose the error rate etc. Also, the tips below about how to speed up demultiplexing apply even with combinatorial demultiplexing.
When doing the above, you will end up with lots of files named first-second.x.fastq.gz
, where
first is the name of the first indexed adapter and second is the name of the second indexed
adapter, and x is 1 or 2. Each indexed adapter combination may correspond to a sample name and
you may want to name your files according to the sample name, not the name of the adapters.
Cutadapt does not have built-in functionality to achieve this, but you can use an external
tool such as mmv
(“multiple move”). First, create a list of patterns in patterns.txt
:
fwdindex1-revindex1.[12].fastq.gz sampleA.#1.fastq.gz
fwdindex1-revindex2.[12].fastq.gz sampleB.#1.fastq.gz
fwdindex1-revindex3.[12].fastq.gz sampleC.#1.fastq.gz
fwdindex2-revindex1.[12].fastq.gz sampleD.#1.fastq.gz
fwdindex2-revindex2.[12].fastq.gz sampleE.#1.fastq.gz
...
Here, fwdindex1/revindex1 etc. are the names of indexes, and sampleA etc. are your sample names. Then rename all files at once with
mmv < patterns.txt
New in version 2.4.
Speeding up demultiplexing¶
Finding many adapters/barcodes simultaneously (which is what demultiplexing in Cutadapt is about), can be sped up tremendously by using the right options since Cutadapt will then be able to create an index of the barcode sequences instead of checking for each barcode separately. Currently, the following conditions need to be met in order for index creation to be enabled:
- The barcodes/adapters must be anchored 5’ adapters (
-g ^ADAPTER
) or anchored 3’ adapters (-a ADAPTER$
). If you usefile:
to read in the adapter sequences from a FASTA file, remember to add the^
or$
to each sequence in the FASTA file. - The maximum error rate (
-e
) must be set such that at most 2 errors are allowed, so use-e 0
,-e 1
or-e 2
. - No IUPAC wildcards must be used in the barcode/adapter. Also, you cannot use the option
--match-read-wildcards
.
An index will be built for all the adapters that fulfill these criteria if there are at least two of them. You can provide additional adapters/barcodes, and they will just not be included in the index. Whether an index is created or not should not affect the results, only how fast you get them.
To see whether an index is created, look for a message like this in the first few lines of Cutadapt’s output:
Building index of 23 adapters ...
Hopefully some of the above restrictions will be lifted in the future.
New in version 1.15: Demultiplexing of paired-end data.
New in version 2.0: Added ability to use an index of adapters for speeding up demultiplexing
New in version An: index can be built even when indels are allowed (that is, --no-indels
is no longer required).
Demultiplexing paired-end reads in mixed orientation¶
For some protocols, the barcode will be located either on R1 or on R2 depending on the orientation in which the DNA fragment was sequenced.
For example, the read layout could be either this
R1: barcode-forwardprimer-sequence R2: reverseprimer-sequence
or this
R1: reverseprimer-sequence R2: barcode-forwardprimer-sequence
To demultiplex such data with Cutadapt, choose one of the orientations first and demultiplex the reads as if only that existed in the data, using a command like this
cutadapt -g file:barcodes.fasta \
-o round1-{name}.R1.fastq.gz \
-p round1-{name}.R2.fastq.gz \
R1.fastq.gz R2.fastq.gz
Then all the read pairs in which no barcode could be found will end up in
round1-unknown.R1.fastq.gz
and round1-unknown.R2.fastq.gz
. This will
also include the pairs in which the barcode was not actually in R1, but in R2. To
demultiplex these reads as well, run Cutadapt a second time with those “unknown”
files as input, but also reverse the roles of R1 and R2
cutadapt -g file:barcodes.fasta \
-o round2-{name}.R2.fastq.gz \
-p round2-{name}.R1.fastq.gz \
round1-unknown.R2.fastq.gz round1-unknown.R1.fastq.gz
Illumina TruSeq¶
Illumina makes their adapter sequences available in the Illumina Adapter Sequences Document.
As an example for how to use that information with Cutadapt, we show
how to trim TruSeq adapters. The document gives the adapter sequence
for read 1 as AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
and for read 2
as AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
. When using Cutadapt, this
means you should trim your paired-end data as follows:
cutadapt \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz \
reads.R1.fastq.gz reads.R2.fastq.gz
See also the section about paired-end adapter trimming above.
Keep in mind that Cutadapt removes the adapter that it finds and also the sequence following it, so even if the actual adapter sequence that is used in a protocol is longer than that (and possibly contains a variable index), it is sufficient to specify a prefix of the sequence(s).
Note
Previous versions of this document also recommended using AGATCGGAAGAGC
as adapter sequence for both read 1 and read 2, but you should avoid doing so
as that sequence occurs multiple times in the human genome.
Some older information is also available in the document Illumina TruSeq Adapters De-Mystified, but keep in mind that it does not cover newer protocols.
Under some circumstances, you may want to consider not trimming adapters at all. For example, a good library prepared for exome, genome or transcriptome sequencing should contain very few reads with adapters anyway. Also, some read mapping programs including BWA-MEM and STAR will soft-clip bases at the 3’ ends of reads that do not match the reference, which will take care of adapters implicitly.
Warning about incomplete adapter sequences¶
Sometimes Cutadapt’s report ends with these lines:
WARNING:
One or more of your adapter sequences may be incomplete.
Please see the detailed output above.
Further up, you’ll see a message like this:
Bases preceding removed adapters:
A: 95.5%
C: 1.0%
G: 1.6%
T: 1.6%
none/other: 0.3%
WARNING:
The adapter is preceded by "A" extremely often.
The provided adapter sequence may be incomplete.
To fix the problem, add "A" to the beginning of the adapter sequence.
This means that in 95.5% of the cases in which an adapter was removed from a
read, the base coming before that was an A
. If your DNA fragments are
not random, such as in amplicon sequencing, then this is to be expected and
the warning can be ignored. If the DNA fragments are supposed to be random,
then the message may be genuine: The adapter sequence may be incomplete and
should include an additional A
in the beginning.
This warning exists because some documents list the Illumina TruSeq adapters
as starting with GATCGGA...
. While that is technically correct, the
library preparation actually results in an additional A
before that
sequence, which also needs to be removed. See the previous
section for the correct sequence.
Dealing with N
bases¶
Cutadapt supports the following options to deal with N
bases in your reads:
--max-n COUNT
- Discard reads containing more than COUNT
N
bases. A fractional COUNT between 0 and 1 can also be given and will be treated as the proportion of maximally allowedN
bases in the read. For example,--max-n 0
removes all reads that contain anyN
bases. --trim-n
Remove flanking
N
bases from each read. That is, a read such as this:NNACGTACGTNNNN
Is trimmed to just
ACGTACGT
. This option is applied after adapter trimming. If you want to get rid ofN
bases before adapter removal, use quality trimming:N
bases typically also have a low quality value associated with them.
Cutadapt’s output¶
Reporting¶
Cutadapt will by default print a full report after it has finished processing
the reads. To suppress all output except error messages, use the option
--quiet
.
The report type can be changed to a one-line summary with the option
--report=minimal
. The output will be a tab-separated table (tsv) with one
header row and one row of content. Here is an example:
$ cutadapt --report=minimal -a ... -m 20 -q 10 -o ... -p ... in.[12].fastq.gz
status in_reads in_bp too_short too_long too_many_n out_reads w/adapters qualtrim_bp out_bp w/adapters2 qualtrim2_bp out2_bp
OK 1000000 202000000 24827 0 0 975173 28968 1674222 97441426 0 0 98492473
This is the meaning of each column:
Column heading | Explanation |
---|---|
status | Incomplete adapter warning (OK or WARN ) |
in_reads | Number of processed reads (read pairs for paired-end) |
in_bp | Number of processed basepairs |
too_short | Number of reads/read pairs that were too short |
too_long | Number of reads/read pairs that were too long |
too_many_n | Number of reads/read pairs that contained too many N |
out_reads | Number of reads written |
w/adapters | Number of reads containing at least one adapter |
qualtrim_bp | Number of bases removed from R1 reads by quality trimming |
out_bp | Number of bases written to R1 reads |
w/adapters2 | Number of R2 reads containing at least one adapter |
qualtrim2_bp | Number of bases removed from R2 reads by quality trimming |
out2_bp | Number of bases written |
The last three fields are omitted for single-end data.
How to read the report¶
After every run, Cutadapt prints out per-adapter statistics. The output starts with something like this:
Sequence: 'ACGTACGTACGTTAGCTAGC'; Length: 20; Trimmed: 2402 times.
If option --revcomp
was used,
this line will additionally contain something like Reverse-complemented:
984 times
. This describes how many times of the 2402 total times the
adapter was found on the reverse complement of the read.
The next piece of information is this:
No. of allowed errors:
0-7 bp: 0; 8-15 bp: 1; 16-20 bp: 2
The adapter, as was shown above, has a length of 20 characters. We are using a custom error rate of 0.12. What this implies is shown above: Matches up to a length of 7 bp are allowed to have no errors. Matches of lengths 8-15 bp are allowd to have 1 error and matches of length 16 or more can have 2 errors. See also the section about error-tolerant matching.
Finally, a table is output that gives more detailed information about the lengths of the removed sequences. The following is only an excerpt; some rows are left out:
Overview of removed sequences
length count expect max.err error counts
3 140 156.2 0 140
4 57 39.1 0 57
5 50 9.8 0 50
6 35 2.4 0 35
7 13 0.3 0 1 12
8 31 0.1 1 0 31
...
100 397 0.0 3 358 36 3
The first row tells us the following: Three bases were removed in 140 reads; randomly, one would expect this to occur 156.2 times; the maximum number of errors at that match length is 0 (this is actually redundant since we know already that no errors are allowed at lengths 0-7 bp).
The last column shows the number of reads that had 0, 1, 2 … errors. In the last row, for example, 358 reads matched the adapter with zero errors, 36 with 1 error, and 3 matched with 2 errors.
In the row for length 7 is an apparent anomaly, where the max.err column is 0 and yet we have 31 reads matching with 1 error. This is because the matches are actually contributed by alignments to the first 8 bases of the adapter with one deletion, so 7 bases are removed but the error cut-off applied is for length 8.
The “expect” column gives only a rough estimate of the number of
sequences that is expected to match randomly, but it can help to
estimate whether the matches that were found are true adapter matches
or if they are due to chance. At lengths 6, for example, only 2.4
reads are expected, but 35 do match, which hints that most of these
matches are due to actual adapters.
For slightly more accurate estimates, you can provide the correct
GC content (as a percentage) of your reads with the option
--gc-content
. The default is --gc-content=50
.
Note that the “length” column refers to the length of the removed sequence. That is, the actual length of the match in the above row at length 100 is 20 since that is the adapter length. Assuming the read length is 100, the adapter was found in the beginning of 397 reads and therefore those reads were trimmed to a length of zero.
The table may also be useful in case the given adapter sequence contains an error. In that case, it may look like this:
...
length count expect max.err error counts
10 53 0.0 1 51 2
11 45 0.0 1 42 3
12 51 0.0 1 48 3
13 39 0.0 1 0 39
14 40 0.0 1 0 40
15 36 0.0 1 0 36
...
We can see that no matches longer than 12 have zero errors. In this case, it indicates that the 13th base of the given adapter sequence is incorrect.
JSON report¶
With --json=filename.cutadapt.json
, a report in JSON format is written to the given file.
We strongly recommend that you use the .cutadapt.json
file name extension for this file for
easier discoverability by log-parsing tools such as MultiQC.
Info file¶
When the --info-file=info.tsv
command-line parameter is given, detailed
information about where adapters were found in each read are written
to the given text file as tab-separated values.
See the description of the info file format.
Reference guide¶
JSON report format¶
The JSON reported is generated if --json=filename.cutadapt.json
is used. The file name
extension must be .cutadapt.json
for the file to be recognized by log-parsing tools such
as MultiQC. (However, at the time of writing, MultiQC does not support
Cutadapt’s JSON report format.)
See how to extract information from the JSON report with jq.
Example¶
This example was reformatted to use less vertical space:
{
"tag": "Cutadapt report",
"schema_version": [0, 1],
"cutadapt_version": "3.5",
"python_version": "3.8.10",
"command_line_arguments": [
"--json=out.cutadapt.json", "-m", "20", "-a", "AACCGGTTACGTTGCA",
"-q", "20", "--discard-trimmed", "-o", "out.fastq.gz", "reads.fastq"],
"cores": 1,
"input": {
"path1": "reads.fastq",
"path2": null,
"paired": false,
"interleaved": null
},
"read_counts": {
"input": 100000,
"filtered": {
"too_short": 251,
"too_long": null,
"too_many_n": null,
"too_many_expected_errors": null,
"casava_filtered": null,
"discard_trimmed": 2061,
"discard_untrimmed": null
},
"output": 97688,
"reverse_complemented": null,
"read1_with_adapter": 2254,
"read2_with_adapter": null
},
"basepair_counts": {
"input": 10100000,
"input_read1": 10100000,
"input_read2": null,
"quality_trimmed": 842048,
"quality_trimmed_read1": 842048,
"quality_trimmed_read2": null,
"output": 9038081,
"output_read1": 9038081,
"output_read2": null
},
"adapters_read1": [
{
"name": "1",
"total_matches": 2254,
"on_reverse_complement": null,
"linked": false,
"five_prime_end": null,
"three_prime_end": {
"type": "regular_three_prime",
"sequence": "AACCGGTTACGTTGCA",
"error_rate": 0.1,
"indels": true,
"error_lengths": [6],
"matches": 2254,
"adjacent_bases": {
"A": 473,
"C": 1240,
"G": 328,
"T": 207,
"": 6
},
"dominant_adjacent_base": null,
"trimmed_lengths": [
{"len": 3, "expect": 1562.5, "counts": [1220]},
{"len": 4, "expect": 390.6, "counts": [319]},
{"len": 5, "expect": 97.7, "counts": [30]},
{"len": 6, "expect": 24.4, "counts": [4]},
{"len": 7, "expect": 24.4, "counts": [5]},
{"len": 8, "expect": 24.4, "counts": [7]},
{"len": 9, "expect": 24.4, "counts": [4]},
{"len": 10, "expect": 24.4, "counts": [7]},
{"len": 11, "expect": 24.4, "counts": [7]},
{"len": 12, "expect": 24.4, "counts": [6]},
{"len": 13, "expect": 24.4, "counts": [8, 2]},
{"len": 14, "expect": 24.4, "counts": [1, 1]},
{"len": 15, "expect": 24.4, "counts": [2, 0]},
{"len": 16, "expect": 24.4, "counts": [3, 1]},
]
}
}
],
"adapters_read2": null
}
Schema¶
Some concepts used in the JSON file:
- Keys are always included. If a key is not applicable, its value is set to null.
- Single-end data appears as “paired-end data without read 2”. That is, values for read 1 are filled in and values for read 2 are set to null.
The file defines the following keys. For nested objects (dictionaries), a dot notation is used, as in “outer_key.inner_key”.
- tag : string
- Always
"Cutadapt report"
. A marker so that this can be recognized as a file produced by Cutadapt. - schema_version : list of two integers
Major and minor version of the schema. If additions are made to the schema, the minor version is increased. If backwards incompatible changes are made, the major version is increased.
Example:
[0, 1]
- cutadapt_version : str
The version of Cutadapt that generated the report.
Example:
"3.5"
- python_version : str
The Python version used to run Cutadapt.
Example:
"3.9"
- command_line_arguments : list of strings
The command-line arguments for this invocation. Only for information, do not parse this.
Example:
["-a", "ACGT", "-o", "out.fastq", "input.fastq"]`
- cores : int
- Number of cores used
- input : dictionary
- Input files
- input.path1 : str
Path to the first input file.
Example:
"reads.1.fastq"
- input.path2 : str | null
- Path to the second input file if given, null otherwise.
- input.paired : bool
- True if input was paired-end reads, false if input was single-end reads. If this is true and input.path2 is null, input was interleaved.
- read_counts : dictionary
- Read count statistics
- read_counts.input : int
- Number of reads (for single-end data) or read pairs (for paired-end data) in the input.
- read_counts.filtered : dictionary
- Statistics about filtered reads. Keys of the dictionary correspond to a filter. If a filter was not used, its value is set to null.
- read_counts.filtered.too_short : int | null
- Number of reads or read pairs that were filtered because they were too short
- read_counts.filtered.too_long : int | null
- Number of reads or read pairs that were filtered because they were too long
- read_counts.filtered.too_many_n : int | null
- Number of reads or read pairs that were filtered because they had too many N bases
- read_counts.filtered.too_many_expected_errors : int | null
- Number of reads or read pairs that were filtered because they had too many expected errors
- read_counts.filtered.casava_filtered : int | null
- Number of reads or read pairs that were filtered because the CASAVA filter was
Y
- read_counts.filtered.discard_trimmed : int | null
- Number of reads or read pairs that were filtered because at least one adapter match was found for them
- read_counts.filtered.discard_untrimmed : int | null
- Number of reads or read pairs that were filtered because no adapter match was found for them
- read_counts.output : int
- Number of reads written to the final output. This plus the sum of all filtered reads/read will equal the number of input reads.
- read_counts.reverse_complemented : int | null
- If
--revcomp
was used, the number of reads that were output reverse-complemented, null otherwise. - read_counts.read1_with_adapter : int | null
- Number of R1 reads (or single-end reads) with at least one adapter match, null if no adapter trimming was done.
- read_counts.read2_with_adapter : int | null
- Number of R2 reads with at least one adapter match, null if input is single end or no adapter trimming was done.
- basepair_counts : dictionary
- Statistics about the number of basepairs.
- basepair_counts.input : int
- Total number of basepairs in the input. (The sum of the lengths of all input reads.)
- basepair_counts.input_read1 : int
- Number of basepairs in the input, read 1 only.
- basepair_counts.input_read2 : int | null
- If paired-end, number of basepairs in the input counting read 2 only, null otherwise.
- basepair_counts.quality_trimmed : int | null
- Total number of basepairs removed due to quality trimming, null if no quality trimming was done.
- basepair_counts.quality_trimmed_read1 : int | null
- Number of basepairs removed from read 1 due to quality trimming, null if no quality trimming was done.
- basepair_counts.quality_trimmed_read2 : int
- Number of basepairs removed from read 2 due to quality trimming, null if no quality trimming was done or if input was single end.
- basepair_counts.output : int
- Total number of basepairs in the final output.
- basepair_counts.output_read1 : int
- Number of basepairs written to the read 1 final output.
- basepair_counts.output_read2 : int | null
- Number of basepairs written to the read 2 final output.
- adapters_read1 : list of dictionaries
- A list with statistics about all adapters that were matched against read 1. The list is empty if no adapter trimming was done. The schema for the items in this list is described below.
- adapters_read2 : list of dictionaries | null
- A list with statistics about all adapters that were matched against read 2. The list is empty if no adapter trimming was done against R2. The value is set to null if the input was single end reads. The schema for the items in this list is described below.
Adapter statistics¶
The statistics about each adapter (items in the adapters_read1 and adapters_read2 list) are dictionaries with the following keys.
- name : str
- The adapter name. If no adapter name was given, a name is automatically generated as “1”, “2”, “3” etc.
- total_matches : int
- Number of times this adapter was found on a read. If
--times
is used, multiple matches per read are possible. - on_reverse_complement : int | null
- If
--revcomp
was used, the number of times the adapter was found on the reverse-complemented read, null otherwise. - linked : bool
- Whether this is a linked adapter. If true, then both
five_prime_end
andthree_prime_end
(below) are filled in and describe the 5’ and 3’ components, respectively, of the linked adapter. - five_prime_end : dictionary | null
Statistics about matches of this adapter to the 5’ end, that is, causing a prefix of the read to be removed.
If the adapter is of type regular_five_prime, noninternal_five_prime or anchored_five_prime, all its matches are summarized here.
If the adapter is a linked adapter (
linked
is true), the matches of its 5’ component are summarized here.If the adapter is of type “anywhere”, the matches that were determined to be 5’ matches are summarized here.
This is null for the other adapter types.
- three_prime_end : dictionary | null
Statistics about matches of this adapter to the 3’ end, that is, causing a suffix of the read to be removed.
If the adapter is of type regular_three_prime, noninternal_three_prime or anchored_three_prime, all its matches are summarized here.
If the adapter is a linked adapter (
linked
is true), the matches of its 3’ component are summarized here.If the adapter is of type “anywhere”, the matches that were determined to be 3’ matches are summarized here.
This is null for the other adapter types.
- three/five_prime_end.type : str
- Type of the adapter. One of these strings:
"regular_five_prime"
"regular_three_prime"
"noninternal_five_prime"
"noninternal_three_prime"
"anchored_five_prime"
"anchored_three_prime"
"anywhere"
For linked adapters, this is the type of its 5’ or 3’ component.
- three/five_prime_end.sequence : str
Sequence of this adapter. For linked adapters, this is the sequence of its 5’ or 3’ component.
Example:
"AACCGGTT"
- three/five_prime_end.error_rate : float
- Error rate for this adapter. For linked adapters, the error rate for the respective end.
- three/five_prime_end.indels : bool
- Whether indels are allowed when matching this adapter against the read.
- three/five_prime_end.error_lengths : list of ints
If the adapter type allows partial matches, this lists the lengths up to which 0, 1, 2 etc. errors are allowed. Example:
[9, 16]
means: 0 errors allowed up to a match of length 9, 1 error up to a match of length 16. The last number in this list is the length of the adapter sequence.For anchored adapter types, this is null.
- three/five_prime_end.matches : int
- The number of matches of this adapter against the 5’ or 3’ end.
- three/five_prime_end.adjacent_bases : dictionary | null
For 3’ adapter types, this shows which bases occurred adjacent to (upstream of) the 3’ adapter match. It is a dictionary mapping the strings “A”, “C”, “G”, “T” and “” (empty string) to the number of occurrences. The empty string covers those cases in which the adjacent base was not one of A, C, G or T or in which there was no adjacent base (3’ adapter found at the beginning of the read).
This is null for 5’ adapters (adjacent base statistics are currently not tracked for those).
- three/five_prime_end.dominant_adjacent_base : str | null
This is set to the dominant adjacent base if adjacent_bases exist and were determined to be sufficiently skewed, corresponding to the warning: “The adapter is preceded by “x” extremely often.”
This is null otherwise.
- three/five_prime_end.trimmed_lengths : list of dictionaries
The histogram of the lengths of removed sequences. Each item in the list is a dictionary that describes how often a sequence of a certain length was removed, broken down by the number of errors in the adapter match.
Example:
"trimmed_lengths": [ {"len": 4, "expect": 390.6, "counts": [319]}, {"len": 5, "expect": 97.7, "counts": [30]}, {"len": 6, "expect": 24.4, "counts": [4]}, {"len": 7, "expect": 24.4, "counts": [5]}, {"len": 15, "expect": 24.4, "counts": [2, 1]}, ]
- three/five_prime_end.trimmed_lengths.expect : float
- How often a sequence of length len would be expected to be removed due to random chance.
- three/five_prime_end.trimmed_lengths.counts : list of int
Element at index i in this list gives how often a sequence of length len was removed due to an adapter match with i errors. Sum these values to get the total count.
Example (5 sequences had 0 errors in the adapter matches, 3 had 1 and 1 had 2):
[5, 3, 1]
Info file format¶
When the --info-file
command-line parameter is given, detailed
information about where adapters were found in each read are written
to the given file. It is a tab-separated text file that contains at
least one row per input read. Normally, there is exactly one row per
input read, but in the following cases, multiple rows may be output:
- The option
--times
is in use.- A linked adapter is used.
A row is written for all input reads, even those that are discarded from the final FASTA/FASTQ output due to filtering options.
Which fields are output in each row depends on whether an adapter match was found in the read or not.
If an adapter match was found, these fields are output in a row:
- Read name
- Number of errors
- 0-based start coordinate of the adapter match
- 0-based end coordinate of the adapter match
- Sequence of the read to the left of the adapter match (can be empty)
- Sequence of the read that was matched to the adapter
- Sequence of the read to the right of the adapter match (can be empty)
- Name of the found adapter.
- Quality values corresponding to sequence left of the adapter match (can be empty)
- Quality values corresponding to sequence matched to the adapter (can be empty)
- Quality values corresponding to sequence to the right of the adapter match (can be empty)
- Flag indicating whether the read was reverse complemented: 1 if yes, 0 if not,
and empty if
--revcomp
was not used.
The concatenation of the fields 5-7 yields the full read sequence. Column 8 identifies the found adapter. The section about named adapters <named-adapters> describes how to give a name to an adapter. Adapters without a name are numbered starting from 1. Fields 9-11 are empty if quality values are not available. Concatenating them yields the full sequence of quality values.
If the adapter match was found on the reverse complement of the read, fields 5 to 7 show the reverse-complemented sequence, and fields 9-11 contain the qualities in reversed order.
If no adapter was found, the format is as follows:
- Read name
- The value -1 (use this to distinguish between match and non-match)
- The read sequence
- Quality values
When parsing the file, be aware that additional columns may be added in the future. Also, some fields can be empty, resulting in consecutive tabs within a line.
If the --times
option is used and greater than 1, each read can appear
more than once in the info file. There will be one line for each found adapter,
all with identical read names. Only for the first of those lines will the
concatenation of columns 5-7 be identical to the original read sequence (and
accordingly for columns 9-11). For subsequent lines, the shown sequence are the
ones that were used in subsequent rounds of adapter trimming, that is, they get
successively shorter.
Linked adapters appear with up to two rows for each read, one for each constituent
adapter for which a match has been found. To be able to see which of the two
adapters a row describes, the adapter name in column 8 is modified: If the row
describes a match of the 5’ adapter, the string ;1
is added. If it describes
a match of the 3’ adapter, the string ;2
is added. If there are two rows, the
5’ match always comes first.
New in version 1.9: Columns 9-11 were added.
New in version 2.8: Linked adapters in info files work.
New in version 3.4: Column 12 (revcomp flag) added
Properly paired reads¶
When reading paired-end files, Cutadapt checks whether the read names match.
Only the part of the read name before the first space is considered. If the
read name ends with 1
or 2
or 3
, then that is also ignored. For example,
two FASTQ headers that would be considered to denote properly paired reads are:
@my_read/1 a comment
and:
@my_read/2 another comment
This is an example for improperly paired read names:
@my_read/1;1
and:
@my_read/2;1
Since the 1
and 2
(and 3
) are ignored only if the occur at the end of the read
name, and since the ;1
is considered to be part of the read name, these
reads will not be considered to be propely paired.
Recipes¶
This section contains short how-to guides for doing certain tasks.
Remove more than one adapter¶
If you want to remove a 5’ and 3’ adapter at the same time, use the support for linked adapters.
If your situation is different, for example, when you have many 5’ adapters but only one 3’ adapter, then you have two options.
First, you can specify the adapters and also --times=2
(or the short
version -n 2
). For example:
cutadapt -g ^TTAAGGCC -g ^AAGCTTA -a TACGGACT -n 2 -o output.fastq input.fastq
This instructs Cutadapt to run two rounds of adapter finding and removal. That means that, after the first round and only when an adapter was actually found, another round is performed. In both rounds, all given adapters are searched and removed. The problem is that it could happen that one adapter is found twice (so the 3’ adapter, for example, could be removed twice).
The second option is to not use the -n
option, but to run Cutadapt twice,
first removing one adapter and then the other. It is easiest if you use a pipe
as in this example:
cutadapt -g ^TTAAGGCC -g ^AAGCTTA input.fastq | cutadapt -a TACGGACT - > output.fastq
Trim poly-A tails¶
If you want to trim a poly-A tail from the 3’ end of your reads, use the 3’
adapter type (-a
) with an adapter sequence of many repeated A
nucleotides. Use the
following notation to specify a sequence that consists of 100 A
:
cutadapt -a "A{100}" -o output.fastq input.fastq
This also works when there are sequencing errors in the poly-A tail. So this read
TACGTACGTACGTACGAAATAAAAAAAAAAA
will be trimmed to:
TACGTACGTACGTACG
If for some reason you would like to use a shorter sequence of A
, you can
do so: The matching algorithm always picks the leftmost match that it can find,
so Cutadapt will do the right thing even when the tail has more A
than you
used in the adapter sequence. However, sequencing errors may result in shorter
matches than desired. For example, using -a "A{10}"
, the read above (where
the AAAT
is followed by eleven A
) would be trimmed to:
TACGTACGTACGTACGAAAT
Depending on your application, perhaps a variant of -a A{10}N{90}
is an
alternative, forcing the match to be located as much to the left as possible,
while still allowing for non-A
bases towards the end of the read.
Trim a fixed number of bases preceding each adapter¶
If the adapters you want to remove are preceded by some unknown sequence (such as a random tag/molecular identifier), you can specify this as part of the adapter sequence in order to remove both in one go.
For example, assume you want to trim Illumina adapters preceded by 10 bases that you want to trim as well. Instead of this command:
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ...
Use this command:
cutadapt -O 13 -a N{10}AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ...
The -O 13
is the minimum overlap for an adapter match, where the 13 is
computed as 3 plus 10 (where 3 is the default minimum overlap and 10 is the
length of the unknown section). If you do not specify it, the adapter sequence
would match the end of every read (because N
matches anything), and ten
bases would then be removed from every read.
Trimming (amplicon-) primers from paired-end reads¶
If the reads are shorter than the amplicon, use
cutadapt -g ^FWDPRIMER -G ^REVPRIMER --discard-untrimmed -o out.1.fastq.gz -p out.2.fastq.gz in.1.fastq.gz in.2.fastq.gz
If the reads can be longer than the amplicon, use a “linked adapter”:
cutadapt -a ^FWDPRIMER...RCREVPRIMER -A ^REVPRIMER...RCFWDPRIMER --discard-untrimmed -o out.1.fastq.gz -p out.2.fastq.gz in.1.fastq.gz in.2.fastq.gz
You need to insert your own sequences as follows. The three dots ...
need to be written as they
are.
- FWDPRIMER
- Sequence of the forward primer
- REVPRIMER
- Sequence of the reverse primer
- RCFWDPRIMER
- Reverse-complemented sequence of the forward primer
- RCREVPRIMER
- Reverse-complemented sequence of the reverse primer
Explanation¶
The full DNA fragment that is put on the sequencer looks like this (looking only at the forward strand):
5’ sequencing primer – forward primer – sequence of interest – reverse complement of reverse primer – reverse complement of 3’ sequencing primer
Since sequencing of R1 starts after the 5’ sequencing primer, R1 will start with the forward primer and then continue into the sequence of interest and possibly into the two primers to the right of it, depending on the read length and how long the sequence of interest is.
If the reads are sufficiently short, R1 will not extend into the reverse primer, and R2 will not extend into the forward primer. In that case, only the forward primer on R1 and the reverse primer on R2 need to be removed:
-g ^FWDPRIMER -G ^REVPRIMER --discard-untrimmed
If the reads are so long that they can possibly extend into the primer on the respective other side, linked adapters for both R1 and R2 can be used. For R1:
-a ^FWDPRIMER...RCREVPRIMER
Sequencing of R2 starts before the 3’ sequencing primer and proceeds along the reverse-complementary strand. For the correct linked adapter, the sequences from above therefore need to be swapped and reverse-complemented:
-A ^REVPRIMER...RCFWDPRIMER
The uppercase -A
specifies that this option is meant to work on R2.
Cutadapt does not reverse-complement any sequences of its own; you
will have to do that yourself.
Finally, you may want to filter the trimmed read pairs.
Option --discard-untrimmed
throws away all read pairs in
which R1 doesn’t start with FWDPRIMER
or in which R2
does not start with REVPRIMER
.
A note on how the filtering works: In linked adapters, by default
the first part (before the ...
) is anchored. Anchored
sequences must occur. If they don’t, then the other sequence
(after the ...
) is not even searched for and the entire
read is internally marked as “untrimmed”. This is done for both
R1 and R2 and as soon as any of them is marked as “untrimmed”,
the entire pair is considered to be “untrimmed”. If
--discard-untrimmed
is used, this means that the entire
pair is discarded if R1 or R2 are untrimmed. (Option
--pair-filter=both
can be used to change this to require
that both were marked as untrimmed.)
Piping paired-end data¶
Sometimes it is necessary to run Cutadapt twice on your data. For example, when
you want to change the order in which read modification or filtering options are
applied. To simplify this, you can use Unix pipes (|
), but this is more
difficult with paired-end data since then input and output consists of two files
each.
The solution is to interleave the paired-end data, send it over the pipe and then de-interleave it in the other process. Here is how this looks in principle:
cutadapt [options] --interleaved in.1.fastq.gz in.2.fastq.gz | \
cutadapt [options] --interleaved -o out.1.fastq.gz -p out.2.fastq.gz -
Note the -
character in the second invocation to Cutadapt.
Support for concatenated compressed files¶
Cutadapt supports concatenated gzip and bzip2 input files.
Check whether a FASTQ file is properly formatted¶
cutadapt -o /dev/null input.fastq
Any problems with the FASTQ file will be detected and reported.
Check whether FASTQ files are properly paired¶
cutadapt -o /dev/null -p /dev/null input.R1.fastq input.R2.fastq
Any problems with the individual FASTQ files or improperly paired reads (mismatching read ids) will be detected and reported.
Rescuing single reads from paired-end reads that were filtered¶
When trimming and filtering paired-end reads, Cutadapt always discards entire read pairs. If you want to keep one of the reads, you need to write the filtered read pairs to an output file and postprocess it.
For example, assume you are using -m 30
to discard too short reads. Cutadapt discards all
read pairs in which just one of the reads is too short (but see the --pair-filter
option).
To recover those (individual) reads that are long enough, you can first use the
--too-short-(paired)-output
options to write the filtered pairs to a file, and then postprocess
those files to keep only the long enough reads.
cutadapt -m 30 -q 20 -o out.1.fastq.gz -p out.2.fastq.gz –too-short-output=tooshort.1.fastq.gz –too-short-paired-output=tooshort.2.fastq.gz in.1.fastq.gz in.2.fastq.gz cutadapt -m 30 -o rescued.a.fastq.gz tooshort.1.fastq.gz cutadapt -m 30 -o rescued.b.fastq.gz tooshort.2.fastq.gz
The two output files rescued.a.fastq.gz
and rescued.b.fastq.gz
contain those individual
reads that are long enough. Note that the file names do not end in .1.fastq.gz
and
.2.fastq.gz
to make it very clear that these files no longer contain synchronized paired-end
reads.
Bisulfite sequencing (RRBS)¶
When trimming reads that come from a library prepared with the RRBS (reduced
representation bisulfite sequencing) protocol, the last two 3’ bases must be
removed in addition to the adapter itself. This can be achieved by using not
the adapter sequence itself, but by adding two wildcard characters to its
beginning. If the adapter sequence is ADAPTER
, the command for trimming
should be:
cutadapt -a NNADAPTER -o output.fastq input.fastq
Details can be found in Babraham bioinformatics’ “Brief guide to RRBS”. A summary follows.
During RRBS library preparation, DNA is digested with the restriction enzyme
MspI, generating a two-base overhang on the 5’ end (CG
). MspI recognizes
the sequence CCGG
and cuts
between C
and CGG
. A double-stranded DNA fragment is cut in this way:
5'-NNNC|CGGNNN-3'
3'-NNNGGC|CNNN-5'
The fragment between two MspI restriction sites looks like this:
5'-CGGNNN...NNNC-3'
3'-CNNN...NNNGGC-5'
Before sequencing (or PCR) adapters can be ligated, the missing base positions must be filled in with GTP and CTP:
5'-ADAPTER-CGGNNN...NNNCcg-ADAPTER-3'
3'-ADAPTER-gcCNNN...NNNGGC-ADAPTER-5'
The filled-in bases, marked in lowercase above, do not contain any original
methylation information, and must therefore not be used for methylation calling.
By prefixing the adapter sequence with NN
, the bases will be automatically
stripped during adapter trimming.
Convert FASTQ to FASTA¶
Cutadapt detects the output format from the output file name extension. Convert FASTQ to FASTA format:
cutadapt -o output.fasta.gz input.fastq.gz
Cutadapt detects FASTA output and omits the qualities.
If output is written to standard output, no output file name is available, so the same format as the input is used.
To force FASTA output even in this case, use the --fasta
option:
cutadapt --fasta input.fastq.gz > out.fasta
Extract information from the JSON report with jq
¶
The JSON report that is written when using the --json
option can be read by jq.
Get the number of reads (or read pairs) written:
jq '.read_counts.output' mysample.cutadapt.json
Get the percentage of reads that contain an adapter:
jq '.read_counts.read1_with_adapter / .read_counts.input * 100' mysample.cutadapt.json
Get how often the first adapter was found:
jq '.adapters_read1[0].total_matches' mysample.cutadapt.json
Quickly test how Cutadapt trims a single sequence¶
Use echo
to write the sequence in FASTA format, and run Cutadapt with --quiet
:
echo -e ">r\nAACCGGTT" | cutadapt --quiet -a CCGGTTGGAA -
Output:
>r
AA
Algorithm details¶
Adapter alignment algorithm¶
Since the publication of the EMBnet journal application note about Cutadapt, the alignment algorithm used for finding adapters has changed significantly. An overview of this new algorithm is given in this section. An even more detailed description is available in Chapter 2 of my PhD thesis Algorithms and tools for the analysis of high-throughput DNA sequencing data.
The algorithm is based on semiglobal alignment, also called free-shift, ends-free or overlap alignment. In a regular (global) alignment, the two sequences are compared from end to end and all differences occuring over that length are counted. In semiglobal alignment, the sequences are allowed to freely shift relative to each other and differences are only penalized in the overlapping region between them:
FANTASTIC
ELEFANT
The prefix ELE
and the suffix ASTIC
do not have a counterpart in the
respective other row, but this is not counted as an error. The overlap FANT
has a length of four characters.
Traditionally, alignment scores are used to find an optimal overlap aligment: This means that the scoring function assigns a positive value to matches, while mismatches, insertions and deletions get negative values. The optimal alignment is then the one that has the maximal total score. Usage of scores has the disadvantage that they are not at all intuitive: What does a total score of x mean? Is that good or bad? How should a threshold be chosen in order to avoid finding alignments with too many errors?
For Cutadapt, the adapter alignment algorithm primarily uses unit costs instead. This means that mismatches, insertions and deletions are counted as one error, which is easier to understand and allows to specify a single parameter for the algorithm (the maximum error rate) in order to describe how many errors are acceptable.
There is a problem with this: When using costs instead of scores, we would like to minimize the total costs in order to find an optimal alignment. But then the best alignment would always be the one in which the two sequences do not overlap at all! This would be correct, but meaningless for the purpose of finding an adapter sequence.
The optimization criteria are therefore a bit different. The basic idea is to consider the alignment optimal that maximizes the overlap between the two sequences, as long as the allowed error rate is not exceeded.
Conceptually, the procedure is as follows:
- Consider all possible overlaps between the two sequences and compute an alignment for each, minimizing the total number of errors in each one.
- Keep only those alignments that do not exceed the specified maximum error rate.
- Then, keep only those alignments that have a maximal number of matches (that is, there is no alignment with more matches). (Note: This has been changed, see the section below for an update.)
- If there are multiple alignments with the same number of matches, then keep only those that have the smallest error rate.
- If there are still multiple candidates left, choose the alignment that starts at the leftmost position within the read.
In Step 1, the different adapter types are taken into account: Only those overlaps that are actually allowed by the adapter type are actually considered.
Alignment algorithm changes in Cutadapt 4¶
The above algorithm has been tweaked slightly in Cutadapt 4. The main problem was that the idea of maximizing the number of matches (criterion 3 in the section above) sometimes leads to unintuitive results.
For example, the previous algorithm would prefer an alignment such as this one:
CCAGTCCTTTCCTGAGAGT Read
|||||||| ||
CCAGTCCT---CT 5' adapter
This alignment was considered to be the best one because it contains 10 matches,
which is the maximum possible.
The three consecutive deletions are ignored when making that decision.
To the user, the unexpected result is visible because the read would end up as
GAGAGT
after trimming.
With the tuned algorithm, the alignment is more sensible:
CCAGTCCTTTCCTGAGAGT Read
||||||||X|
CCAGTCCTCT 5' adapter
The trimmed read is now CCTGAGAGT
, which is what one would likely expect.
The alignment algorithm in Cutadapt can perhaps now be described as a hybrid algorithm that uses both edit distance and score:
- Edit distance is used to fill out the dynamic programming matrix.
Conceptually, this can be seen as computing the edit distance for all
possible overlaps between the read and the adapter.
We need to use the edit distance as optimization criterion at this
stage because we want to be able to let the user provide a maximum
error rate (
-e
). Also, using edit distance (that is, unit costs) allows using some optimizations while filling in the matrix (Ukkonen’s trick). - A second matrix with scores is filled in simultaneously. The value in a cell is the score of the edit-distance-based alignment, the score is not used as optimization criterion.
- Finally, the score is used to decide which of the overlaps between read and adapter is the best one. (This means looking into the last row and column of the score matrix.)
The score function is currently: match: +1, mismatch: -1, indel: -2
A second change in the alignment algorithm is relevant if there are multiple adapter occurrences in a read (such as adapter dimers). With the new algorithm, leftmost (earlier) adapter occurrences are now more reliably preferred even if a later match has fewer errors.
Here are two examples from the SRR452441 dataset (R1 only), trimmed with the standard Illumina adapter. The top row shows the alignment as found by the previous algorithm, the middle row shows the sequencing read, and the last row shows the alignment as found by the updated algorithm.
@SRR452441.2151945
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Previous alignment
||||||||||||||||||||||||||||||||||
-GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCACACGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCACACGAATCTCGTATGCCGTCTTCT
X|||||||||||||||||||||||||||||||||
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC New alignment
Previously the read was trimmed to the first 40 bases, now the earlier, nearly full-length occurrence is taken into account, and the read is empty after trimming.
@SRR452441.2157038
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Previous alignment
||||||||||||||||||||||||||||||||||
-GATCGGAAGAGCACACGTCTGAACTCCAGTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCACACGAATCTCGTATGCCGTCTTCTGCTTGAAAA
X||||||||||||||||||||||||||||||||X
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC New alignment
Only very few reads should be affected by the above changes (in SRR452441, which has 2.2 million reads, only four reads were trimmed differently). In those cases where it matters, however, there should now be fewer surprises.
Quality trimming algorithm¶
The trimming algorithm implemented in Cutadapt is the same as the one used by BWA, but applied to both ends of the read in turn (if requested). That is: Subtract the given cutoff from all qualities; compute partial sums from all indices to the end of the sequence; cut the sequence at the index at which the sum is minimal. If both ends are to be trimmed, repeat this for the other end.
The basic idea is to remove all bases starting from the end of the read whose quality is smaller than the given threshold. This is refined a bit by allowing some good-quality bases among the bad-quality ones. In the following example, we assume that the 3’ end is to be quality-trimmed.
Assume you use a threshold of 10 and have these quality values:
42, 40, 26, 27, 8, 7, 11, 4, 2, 3
Subtracting the threshold gives:
32, 30, 16, 17, -2, -3, 1, -6, -8, -7
Then sum up the numbers, starting from the end (partial sums). Stop early if the sum is greater than zero:
(70), (38), 8, -8, -25, -23, -20, -21, -15, -7
The numbers in parentheses are not computed (because 8 is greater than zero), but shown here for completeness. The position of the minimum (-25) is used as the trimming position. Therefore, the read is trimmed to the first four bases, which have quality values 42, 40, 26, 27.
Developing¶
The Cutadapt source code is on GitHub. Cutadapt is written in Python 3 with some extension modules that are written in Cython.
Development installation¶
For development, make sure that you install Cython and tox. We also recommend using a virtualenv. This sequence of commands should work:
git clone https://github.com/marcelm/cutadapt.git # or clone your own fork
cd cutadapt
virtualenv .venv
source .venv/bin/activate
pip install Cython pytest tox pre-commit
pre-commit install
pip install -e .
Then you should be able to run Cutadapt:
cutadapt --version
Remember that you do not need to activate a virtualenv to run binaries in it, so this works even when the environment is activated:
venv/bin/cutadapt --version
The tests can then be run like this:
pytest
Or with tox (but then you will need to have binaries for all tested Python versions installed):
tox
Making a release¶
A new release is automatically deployed to PyPI whenever a new tag is pushed to the Git repository.
Cutadapt uses setuptools_scm to automatically manage version numbers. This means that the version is not stored in the source code but derived from the most recent Git tag. The following procedure can be used to bump the version and make a new release.
Update
CHANGES.rst
(version number and list of changes)Ensure you have no uncommitted changes in the working copy.
Run a
git pull
.Run
tox
, ensuring all tests pass.Tag the current commit with the version number (there must be a
v
prefix):git tag v0.1
To release a development version, use a
dev
version number such asv1.17.dev1
. Users will not automatically get these unless they usepip install --pre
.Push the tag:
git push --tags
Wait for the GitHub Action to finish and to deploy to PyPI.
The bioconda recipe also needs to be updated, but the bioconda bot will likely do this automatically if you just wait a little while.
Ensure that the list of dependencies (the
requirements:
section in the recipe) is in sync with thesetup.cfg
file.
If something went wrong after a version has already been tagged and published to PyPI, fix the problem and tag a new version. Do not change a version that has already been uploaded.
Contributing¶
Contributions to Cutadapt in the form of source code or documentation improvements or helping out with responding to issues are welcome!
To contribute to Cutadapt development, it is easiest to send in a pull request (PR) on GitHub.
Here are some guidelines for how to do this. They are not strict rules. When in doubt, send in a PR and we will sort it out.
- Limit a PR to a single topic. Submit multiple PRs if necessary. This way, it is easier to discuss the changes individually, and in case we find that one of them should not go in, the others can still be accepted.
- For larger changes, consider opening an issue first to plan what you want to do.
- Include appropriate unit or integration tests. Sometimes, tests are hard to write or don’t make sense. If you think this is the case, just leave the tests out initially and we can discuss whether to add any.
- Add documentation and a changelog entry if appropriate.
Code style¶
- The source code needs to be formatted with black. If you install pre-commit, the formatting will be done for you.
- There are inconsistencies in the current code base since it’s a few years old already. New code should follow the current rules, however.
- Prefer double quotation marks in new code. This will also make the diff smaller if we eventually switch to black.
- Using an IDE is beneficial (PyCharm, for example). It helps to catch lots of style issues early (unused imports, spacing etc.).
- Avoid unnecessary abbreviations for variable names. Code is more often read than written.
- When writing a help text for a new command-line option, look at the output of
cutadapt --help
and try to make it look nice and short. - In comments and documentation, capitalize FASTQ, BWA, CPU etc.
Ideas/To Do¶
This is a rather unsorted list of features that would be nice to have, of things that could be improved in the source code, and of possible algorithmic improvements.
- show average error rate
- length histogram
--detect
prints out best guess which of the given adapters is the correct one- warn when given adapter sequence contains non-IUPAC characters
Specifying adapters¶
Allow something such as -a ADAP$TER
or -a ADAPTER$NNN
.
This would be a way to specify less strict anchoring.
Allow N{3,10}
as in regular expressions (for a variable-length sequence).
Use parentheses to specify the part of the sequence that should be kept:
-a (...)ADAPTER
(default)-a (...ADAPTER)
(default)-a ADAPTER(...)
(default)-a (ADAPTER...)
(??)
Or, specify the part that should be removed:
-a ...(ADAPTER...)
-a ...ADAPTER(...)
-a (ADAPTER)...
Available letters for command-line options¶
- Lowercase letters: i, k, s, w
- Uppercase letters: C, D, E, F, H, I, J, K, L, P, R, S, T, V, W
- Deprecated, could be re-used: c, d, t
- Planned/reserved: Q (paired-end quality trimming), V (alias for –version)
Changelog¶
v4.0 (2022-04-13)¶
- #604, #608: The alignment algorithm was tweaked to penalize indels more and to more accurately pick the leftmost adapter occurrence if there are multiple. This will normally affect very few reads, but should generally lead to fewer surprising results in cases where it matters. Because this changes trimming results, it was appropriate to bump the major version to 4.
- #607: Print an error when an output file was specified
multiple times (for example, for
--untrimmed-output
and--too-short-output
). Sending output from different filters to the same file is not supported at the moment. - #603: When
-e
was used with an absolute number of errors and there wereN
wildcards in the sequence, the actual number of allowed errors was too low. - Speed up quality trimming (both
-q
and--nextseq-trim
) somewhat. - Python 3.6 is no longer supported as it is end-of-life.
v3.7 (2022-02-23)¶
- #600: Fixed
{match_sequence}
placeholder not working when renaming paired-end reads.
v3.6 (2022-02-18)¶
- #437: Add
{match_sequence}
to the placeholders that--rename
accepts. This allows to add the sequence matching an adapter (including errors) to the read header. An empty string is inserted if there is no match. - #589: Windows wheels are now available on PyPI. That is,
pip install
will no longer attempt to compile things, but just install a pre-compiled binary. - #592: Clarify in documentation and error messages that anchored
adapters need to match in full and that therefore setting an explict
minimum overlap (
min_overlap=
,o=
) for them is not possible.
v3.5 (2021-09-29)¶
- #555: Add support for dumping statistics in JSON format using
--json
. - #541: Add a “Read fate breakdown” section heading to the report, and also
add statistics for reads discarded because of
--discard-untrimmed
and--discard-trimmed
. With this, the numbers in that section should add up to 100%. - Add option
-Q
, which allows to specify a quality-trimming threshold for R2 that is different from the one for R1. - #567: Add
noindels
adapter-trimming parameter. You can now write-a "ADAPTER;noindels"
to disallow indels for a single adapter only. - #570: Fix
--pair-adapters
not finding some pairs when reads contain more than one adapter. - #524: Fix a memory leak when using
--info-file
with multiple cores. - #559: Fix adjacent base statistics not being shown for linked adapters.
v3.4 (2021-03-30)¶
- #481: An experimental single-file Windows executable of Cutadapt is available for download on the GitHub “releases” page.
- #517: Report correct sequence in info file if read was reverse complemented
- #517: Added a column to the info file that shows whether the read was
reverse-complemented (if
--revcomp
was used) - #320: Fix (again) “Too many open files” when demultiplexing
v3.3 (2021-03-04)¶
- #504: Fix a crash on Windows.
- #490: When
--rename
is used with--revcomp
, disable adding therc
suffix to reads that were reverse-complemented. - Also, there is now a
{rc}
template variable for the--rename
option, which is replaced with “rc” if the read was reverse-complemented (and the empty string if not). - #512: Fix issue #128 once more (the “Reads written” figure in the report
incorrectly included both trimmed and untrimmed reads if
--untrimmed-output
was used). - #515: The report is now sent to stderr if any output file is written to stdout
v3.2 (2021-01-07)¶
- #437: Implement a
--rename
option for flexible read name modifications such as moving a barcode sequence into the read name. - #503: The index for demultiplexing is now created a lot faster (within seconds instead of minutes) when allowing indels.
- #499: Fix combinatorial demultiplexing not working when using multiple cores.
v3.1 (2020-12-03)¶
- #443: With
--action=retain
, it is now possible to trim reads while leaving the adapter sequence itself in the read. That is, only the sequence before (for 5’ adapters) or after (for 3’ adapters) is removed. With linked adapters, both adapters are retained. - #495: Running with multiple cores did not work using macOS and Python 3.8+. To prevent problems like these in the future, automated testing has been extended to also run on macOS.
- #482: Print statistics for
--discard-casava
and--max-ee
in the report. - #497: The changelog for 3.0 previously forgot to mention that the following
options, which were deprecated in version 2.0, have now been removed, and
using them will lead to an error:
--format
,--colorspace
,-c
,-d
,--double-encode
,-t
,--trim-primer
,--strip-f3
,--maq
,--bwa
,--no-zero-cap
. This frees up some single-character options, allowing them to be re-purposed for future Cutadapt features.
v3.0 (2020-11-10)¶
- Demultiplexing on multiple cores is now supported. This was the last feature that only ran single-threaded.
- #478: Demultiplexing now always generates all possible output files.
- #358: You can now use
-e
also to specify the maximum number of errors (instead of the maximum error rate). For example, write-e 2
to allow two errors over a full-length adapter match. - #486: Trimming many anchored adapters (for example when demultiplexing)
is now faster by using an index even when indels are allowed. Previously, Cutadapt
would only be able to build an index with
--no-indels
. - #469: Cutadapt did not run under Python 3.8 on recent macOS versions.
- #425: Change the default compression level for
.gz
output files from 6 to 5. This reduces the time used for compression by about 50% while increasing file size by less than 10%. To get the old behavior, use--compression-level=6
. If you use Cutadapt to create intermediate files that are deleted anyway, consider also using the even faster option-Z
(same as--compression-level=1
). - #485: Fix that, under some circumstances, in particular when trimming a 5’ adapter and there was a mismatch in its last nucleotide(s), not the entire adapter sequence would be trimmed from the read. Since fixing this required changed the alignment algorithm slightly, this is a backwards incompatible change.
- Fix that the report did not include the number of reads that are too long, too short
or had too many
N
. (This unintentionally disappeared in a previous version.) - #487: When demultiplexing, the reported number of written pairs was always zero.
- #497: The following options, which were deprecated in version 2.0, have
been removed, and using them will lead to an error:
--format
,--colorspace
,-c
,-d
,--double-encode
,-t
,--trim-primer
,--strip-f3
,--maq
,--bwa
,--no-zero-cap
. This frees up some single-character options, allowing them to be re-purposed for future Cutadapt features. - Ensure Cutadapt runs under Python 3.9.
- Drop support for Python 3.5.
v2.10 (2020-04-22)¶
- Fixed a performance regression introduced in version 2.9.
- #449:
--action=
could not be used with--pair-adapters
. Fix contributed by wlokhorst. - #450:
--untrimmed-output
,--too-short-output
and--too-long-output
can now be written interleaved. - #453: Fix problem that
N
wildcards in adapters did not matchN
characters in the read.N
characters now match any character in the read, independent of whether--match-read-wildcards
is used or not. - With
--action=lowercase
/mask
, print which sequences would have been removed in the “Overview of removed sequences” statistics. Previously, it would show that no sequences have been removed.
v2.9 (2020-03-18)¶
- #441: Add a
--max-ee
(or--max-expected-errors
) option for filtering reads whose number of expected errors exceeds the given threshold. The idea comes from Edgar et al. (2015). - #438: The info file now contains the `` rc`` suffix that is added to
the names of reverse-complemented reads (with
--revcomp
). - #448:
.bz2
and.xz
output wasn’t possible in multi-core mode.
v2.8 (2020-01-13)¶
- #220: With option
--revcomp
, Cutadapt now searches both the read and its reverse complement for adapters. The version that matches best is kept. This can be used to “normalize” strandedness. - #430:
--action=lowercase
now works with linked adapters - #431: Info files can now be written even for linked adapters.
v2.7 (2019-11-22)¶
- #427: Multicore is now supported even when using
--info-file
,--rest-file
or--wildcard-file
. The only remaining feature that still does not work with multicore is now demultiplexing. - #290: When running on a single core, Cutadapt no longer spawns
external
pigz
processes for writing gzip-compressed files. This is a first step towards ensuring that using--cores=n
uses only at most n CPU cores. - This release adds support for Python 3.8.
v2.6 (2019-10-26)¶
- #395: Do not show animated progress when
--quiet
is used. - #399: When two adapters align to a read equally well (in terms of the number of matches), prefer the alignment that has fewer errors.
- #401 Give priority to adapters given earlier on the command line. Previously, the priority was: All 3’ adapters, all 5’ adapters, all anywhere adapters. In rare cases this could lead to different results.
- #404: Fix an issue preventing Cutadapt from being used on Windows.
- This release no longer supports Python 3.4 (which has reached end of life).
v2.5 (2019-09-04)¶
- #391: Multicore is now supported even when using
--untrimmed-output
,--too-short-output
,--too-long-output
or the corresponding...-paired-output
options. - #393: Using
--info-file
no longer crashes when processing paired-end data. However, the info file itself will only contain results for R1. - #394: Options
-e
/--no-indels
/-O
were ignored for linked adapters - #320: When a “Too many open files” error occurs during demultiplexing, Cutadapt can now automatically raise the limit and re-try if the limit is a “soft” limit.
v2.4 (2019-07-09)¶
- #292: Implement support for demultiplexing paired-end reads that use combinatorial indexing (“combinatorial demultiplexing”).
- #384: Speed up reading compressed files by requiring an xopen version that uses an external pigz process even for reading compressed input files (not only for writing).
- #381: Fix
--report=minimal
not working. - #380: Add a
--fasta
option for forcing that FASTA is written to standard output even when input is FASTQ. Previously, forcing FASTA was only possible by providing an output file name.
v2.3 (2019-04-25)¶
- #378: The
--pair-adapters
option, added in version 2.1, was not actually usable for demultiplexing.
v2.2 (2019-04-20)¶
v2.1 (2019-03-15)¶
- #366: Fix problems when combining
--cores
with reading from standard input or writing to standard output. - #347: Support “paired adapters”. One use case is demultiplexing Illumina Unique Dual Indices (UDI).
v2.0 (2019-03-06)¶
This is a major new release with lots of bug fixes and new features, but also some backwards-incompatible changes. These should hopefully not affect too many users, but please make sure to review them and possibly update your scripts!
Backwards-incompatible changes¶
- #329: Linked adapters specified with
-a ADAPTER1...ADAPTER2
are no longer anchored by default. To get results consist with the old behavior, use-a ^ADAPTER1...ADAPTER2
instead. - Support for colorspace data was removed. Thus, the following command-line
options can no longer be used:
-c
,-d
,-t
,--strip-f3
,--maq
,--bwa
,--no-zero-cap
. - “Legacy mode” has been removed. This mode was enabled under certain
conditions and would change the behavior such that the read-modifying options
such as
-q
would only apply to the forward/R1 reads. This was necessary for compatibility with old Cutadapt versions, but became increasingly confusing. - #360: Computation of the error rate of an adapter match no longer
counts the
N
wildcard bases. Previously, an adapter likeN{18}CC
(18N
wildcards followed byCC
) would effectively match anywhere because the default error rate of 0.1 (10%) would allow for two errors. The error rate of a match is now computed as the number of non-N
bases in the matching part of the adapter divided by the number of errors. - This release of Cutadapt requires at least Python 3.4 to run. Python 2.7 is no longer supported.
Features¶
- A progress indicator is printed while Cutadapt is working. If you redirect standard error to a file, the indicator is disabled.
- Reading of FASTQ files has gotten faster due to a new parser. The FASTA and FASTQ reading/writing functions are now available as part of the dnaio library. This is a separate Python package that can be installed independently from Cutadapt. There is one regression at the moment: FASTQ files that use a second header (after the “+”) will have that header removed in the output.
- Some other performance optimizations were made. Speedups of up to 15% are possible.
- Demultiplexing has become a lot faster under certain conditions.
- #335: For linked adapters, it is now possible to specify which of the two adapters should be required, overriding the default.
- #166: By specifying
--action=lowercase
, it is now possible to not trim adapters, but to instead convert the section of the read that would have been trimmed to lowercase.
Bug fixes¶
- Removal of legacy mode fixes also #345:
--length
would not enable legacy mode. - The switch to
dnaio
also fixed #275: Input files with non-standard names now no longer lead to a crash. Instead the format is now recognized from the file content. - Fix #354: Sequences given using
file:
can now be unnamed. - Fix #257 and #242: When only R1 or only R2 adapters are given, the
--pair-filter
setting is now forced toboth
for the--discard-untrimmed
(and--untrimmed-(paired-)output
) filters. Otherwise, with the default--pair-filter=any
, all pairs would be considered untrimmed because one of the reads in the pair is always untrimmed.
v1.18 (2018-09-07)¶
Features¶
- Close #327: Maximum and minimum lengths can now be specified
separately for R1 and R2 with
-m LENGTH1:LENGTH2
. One of the lengths can be omitted, in which case only the length of the other read is checked (as in-m 17:
or-m :17
). - Close #322: Use
-j 0
to auto-detect how many cores to run on. This should even work correctly on cluster systems when Cutadapt runs as a batch job to which fewer cores than exist on the machine have been assigned. Note that the number of threads used bypigz
cannot be controlled at the moment, see #290. - Close #225: Allow setting the maximum error rate and minimum overlap
length per adapter. A new syntax for adapter-specific
parameters was added for this. Example:
-a "ADAPTER;min_overlap=5"
. - Close #152: Using the new syntax for adapter-specific parameters,
it is now possible to allow partial matches of a 3’ adapter at the 5’ end
(and partial matches of a 5’ adapter at the 3’ end) by specifying the
anywhere
parameter (as in-a "ADAPTER;anywhere"
). - Allow
--pair-filter=first
in addition toboth
andany
. If used, a read pair is discarded if the filtering criterion applies to R1; and R2 is ignored. - Close #112: Implement a
--report=minimal
option for printing a succinct two-line report in tab-separated value (tsv) format. Thanks to @jvolkening for coming up with an initial patch!
Bug fixes¶
- Fix #128: The “Reads written” figure in the report incorrectly
included both trimmed and untrimmed reads if
--untrimmed-output
was used.
Other¶
- The options
--no-trim
and--mask-adapter
should now be written as--action=mask
and--action=none
. The old options still work. - This is the last release to support colorspace data
- This is the last release to support Python 2.
v1.17 (2018-08-20)¶
- Close #53: Implement adapters that disallow internal matches.
This is a bit like anchoring, but less strict: The adapter sequence
can appear at different lengths, but must always be at one of the ends.
Use
-a ADAPTERX
(with a literalX
) to disallow internal matches for a 3’ adapter. Use-g XADAPTER
to disallow for a 5’ adapter. - @klugem contributed PR #299: The
--length
option (and its alias-l
) can now be used with negative lengths, which will remove bases from the beginning of the read instead of from the end. - Close #107: Add a
--discard-casava
option to remove reads that did not pass CASAVA filtering (this is possibly relevant only for older datasets). - Fix #318: Cutadapt should now be installable with Python 3.7.
- Running Cutadapt under Python 3.3 is no longer supported (Python 2.7 or 3.4+ are needed)
- Planned change: One of the next Cutadapt versions will drop support for Python 2 entirely, requiring Python 3.
v1.16 (2018-02-21)¶
v1.15 (2017-11-23)¶
- Cutadapt can now run on multiple CPU cores in parallel! To enable
it, use the option
-j N
(or the long form--cores=N
), whereN
is the number of cores to use. Multi-core support is only available on Python 3, and not yet with some command-line arguments. See the new section about multi-core in the documentation for details. When writing.gz
files, make sure you havepigz
installed to get the best speedup. - The plan is to make multi-core the default (automatically using as many cores as are available) in future releases, so please test it and report an issue if you find problems!
- Issue #256:
--discard-untrimmed
did not have an effect on non-anchored linked adapters. - Issue #118: Added support for demultiplexing of paired-end data.
v1.14 (2017-06-16)¶
- Fix: Statistics for 3’ part of a linked adapter were reported incorrectly
- Fix issue #244:
Quality trimming with
--nextseq-trim
would not apply to R2 when trimming paired-end reads. --nextseq-trim
now disables legacy mode.- Fix issue #246: installation failed on non-UTF8 locale
v1.13 (2017-03-16)¶
- The 3’ adapter of linked adapters can now be anchored. Write
-a ADAPTER1...ADAPTER2$
to enable this. Note that the 5’ adapter is always anchored in this notation. - Issue #224: If you want the 5’ part of a linked adapter not to be
anchored, you can now write
-g ADAPTER...ADAPTER2
(note-g
instead of-a
). This feature is experimental and may change behavior in the next release. - Issue #236: For more accurate statistics, it is now possible to specify the
GC content of the input reads with
--gc-content
. This does not change trimming results, only the number in the “expect” column of the report. Since this is probably not needed by many people, the option is not listed when runningcutadapt --help
. - Issue #235: Adapter sequences are now required to contain only
valid IUPAC codes (lowercase is also allowed,
U
is an alias forT
). This should help to catch hard-to-find bugs, especially in scripts. Use option-N
to match characters literally (possibly useful for amino acid sequences). - Documentation updates and some refactoring of the code
v1.12 (2016-11-28)¶
- Add read modification option
--length
(short:--l
), which will shorten each read to the given length. - Cutadapt will no longer complain that it has nothing to do when you do not
give it any adapters. For example, you can use this to convert file formats:
cutadapt -o output.fasta input.fastq.gz
converts FASTQ to FASTA. - The
xopen
module for opening compressed files was moved to a separate package on PyPI.
v1.11 (2016-08-16)¶
- The
--interleaved
option no longer requires that both input and output is interleaved. It is now possible to have two-file input and interleaved output, and to have interleaved input and two-file output. - Fix issue #202: First and second FASTQ header could get out of sync when options modifying the read name were used.
v1.10 (2016-05-19)¶
- Added a new “linked adapter” type, which can be used to search for a 5’ and a
3’ adapter at the same time. Use
-a ADAPTER1...ADAPTER2
to search for a linked adapter. ADAPTER1 is interpreted as an anchored 5’ adapter, which is searched for first. Only if ADAPTER1 is found will ADAPTER2 be searched for, which is a regular 3’ adapter. - Added experimental
--nextseq-trim
option for quality trimming of NextSeq data. This is necessary because that machine cannot distinguish between G and reaching the end of the fragment (it encodes G as ‘black’). - Even when trimming FASTQ files, output can now be FASTA (quality values are
simply dropped). Use the
-o
/-p
options with a file name that ends in.fasta
or.fa
to enable this. - Cutadapt does not bundle pre-compiled C extension modules (
.so
files) anymore. This affects only users that run cutadapt directly from an unpacked tarball. Install throughpip
orconda
instead. - Fix issue #167: Option
--quiet
was not entirely quiet. - Fix issue #199: Be less strict when checking for properly-paired reads.
- This is the last version of cutadapt to support Python 2.6. Future versions will require at least Python 2.7.
v1.9.1 (2015-12-02)¶
- Added
--pair-filter
option, which modifies how filtering criteria apply to paired-end reads - Add
--too-short-paired-output
and--too-long-paired-output
options. - Fix incorrect number of trimmed bases reported if
--times
option was used.
v1.9 (2015-10-29)¶
- Indels in the alignment can now be disabled for all adapter types (use
--no-indels
). - Quality values are now printed in the info file (
--info-file
) when trimming FASTQ files. Fixes issue #144. - Options
--prefix
and--suffix
, which modify read names, now accept the placeholder{name}
and will replace it with the name of the found adapter. Fixes issue #104. - Interleaved FASTQ files: With the
--interleaved
switch, paired-end reads will be read from and written to interleaved FASTQ files. Fixes issue #113. - Anchored 5’ adapters can now be specified by writing
-a SEQUENCE...
(note the three dots). - Fix
--discard-untrimmed
and--discard-trimmed
not working as expected in paired-end mode (issue #146). - The minimum overlap is now automatically reduced to the adapter length if it is too large. Fixes part of issue #153.
- Thanks to Wolfgang Gerlach, there is now a Dockerfile.
- The new
--debug
switch makes cutadapt print out the alignment matrix.
v1.8.3 (2015-07-29)¶
- Fix issue #95: Untrimmed reads were not listed in the info file.
- Fix issue #138: pip install cutadapt did not work with new setuptools versions.
- Fix issue #137: Avoid a hang when writing to two or more gzip-compressed output files in Python 2.6.
v1.8.2 (2015-07-24)¶
v1.8.1 (2015-04-09)¶
- Fix #110: Counts for ‘too short’ and ‘too long’ reads were swapped in statistics.
- Fix #115: Make
--trim-n
work also on second read for paired-end data.
v1.8 (2015-03-14)¶
Support single-pass paired-end trimming with the new
-A
/-G
/-B
/-U
parameters. These work just like their -a/-g/-b/-u counterparts, but they specify sequences that are removed from the second read in a pair.Also, if you start using one of those options, the read modification options such as
-q
(quality trimming) are applied to both reads. For backwards compatibility, read modifications are applied to the first read only if neither of-A
/-G
/-B
/-U
is used. See the documentation for details.This feature has not been extensively tested, so please give feedback if something does not work.
The report output has been re-worked in order to accomodate the new paired-end trimming mode. This also changes the way the report looks like in single-end mode. It is hopefully now more accessible.
Chris Mitchell contributed a patch adding two new options:
--trim-n
removes anyN
bases from the read ends, and the--max-n
option can be used to filter out reads with too manyN
.Support notation for repeated bases in the adapter sequence: Write
A{10}
instead ofAAAAAAAAAA
. Useful for poly-A trimming: Use-a A{100}
to get the longest possible tail.Quality trimming at the 5’ end of reads is now supported. Use
-q 15,10
to trim the 5’ end with a cutoff of 15 and the 3’ end with a cutoff of 10.Fix incorrectly reported statistics (> 100% trimmed bases) when
--times
set to a value greater than one.Support .xz-compressed files (if running in Python 3.3 or later).
Started to use the GitHub issue tracker instead of Google Code. All old issues have been moved.
v1.7 (2014-11-25)¶
- IUPAC characters are now supported. For example, use
-a YACGT
for an adapter that matches bothCACGT
andTACGT
with zero errors. Disable with-N
. By default, IUPAC characters in the read are not interpreted in order to avoid matches in reads that consist of many (low-quality)N
bases. Use--match-read-wildcards
to enable them also in the read. - Support for demultiplexing was added. This means that reads can be written to different files depending on which adapter was found. See the section in the documentation for how to use it. This is currently only supported for single-end reads.
- Add support for anchored 3’ adapters. Append
$
to the adapter sequence to force the adapter to appear in the end of the read (as a suffix). Closes issue #81. - Option
--cut
(-u
) can now be specified twice, once for each end of the read. Thanks to Rasmus Borup Hansen for the patch! - Options
--minimum-length
/--maximum-length
(-m
/-M
) can be used standalone. That is, cutadapt can be used to filter reads by length without trimming adapters. - Fix bug: Adapters read from a FASTA file can now be anchored.
v1.6 (2014-10-07)¶
- Fix bug: Ensure
--format=...
can be used even with paired-end input. - Fix bug: Sometimes output files would be incomplete because they were not closed correctly.
- Alignment algorithm is a tiny bit faster.
- Extensive work on the documentation. It’s now available at https://cutadapt.readthedocs.org/ .
- For 3’ adapters, statistics about the bases preceding the trimmed adapter are collected and printed. If one of the bases is overrepresented, a warning is shown since this points to an incomplete adapter sequence. This happens, for example, when a TruSeq adapter is used but the A overhang is not taken into account when running cutadapt.
- Due to code cleanup, there is a change in behavior: If you use
--discard-trimmed
or--discard-untrimmed
in combination with--too-short-output
or--too-long-output
, then cutadapt now writes also the discarded reads to the output files given by the--too-short
or--too-long
options. If anyone complains, I will consider reverting this. - Galaxy support files are now in a separate repository.
v1.5 (2014-08-05)¶
- Adapter sequences can now be read from a FASTA file. For example, write
-a file:adapters.fasta
to read 3’ adapters fromadapters.fasta
. This works also for-b
and-g
. - Add the option
--mask-adapter
, which can be used to not remove adapters, but to instead mask them withN
characters. Thanks to Vittorio Zamboni for contributing this feature! - U characters in the adapter sequence are automatically converted to T.
- Do not run Cython at installation time unless the –cython option is provided.
- Add the option -u/–cut, which can be used to unconditionally remove a number of bases from the beginning or end of each read.
- Make
--zero-cap
the default for colorspace reads. - When the new option
--quiet
is used, no report is printed after all reads have been processed. - When processing paired-end reads, cutadapt now checks whether the reads are properly paired.
- To properly handle paired-end reads, an option –untrimmed-paired-output was added.
v1.4 (2014-03-13)¶
- This release of cutadapt reduces the overhead of reading and writing files. On my test data set, a typical run of cutadapt (with a single adapter) takes 40% less time due to the following two changes.
- Reading and writing of FASTQ files is faster (thanks to Cython).
- Reading and writing of gzipped files is faster (up to 2x) on systems
where the
gzip
program is available. - The quality trimming function is four times faster (also due to Cython).
- Fix the statistics output for 3’ colorspace adapters: The reported lengths were one too short. Thanks to Frank Wessely for reporting this.
- Support the
--no-indels
option. This disallows insertions and deletions while aligning the adapter. Currently, the option is only available for anchored 5’ adapters. This fixes issue 69. - As a sideeffect of implementing the –no-indels option: For colorspace, the
length of a read (for
--minimum-length
and--maximum-length
) is now computed after primer base removal (when--trim-primer
is specified). - Added one column to the info file that contains the name of the found adapter.
- Add an explanation about colorspace ambiguity to the README
v1.3 (2013-11-08)¶
- Preliminary paired-end support with the
--paired-output
option (contributed by James Casbon). See the README section on how to use it. - Improved statistics.
- Fix incorrectly reported amount of quality-trimmed Mbp (issue 57, fix by Chris Penkett)
- Add the
--too-long-output
option. - Add the
--no-trim
option, contributed by Dave Lawrence. - Port handwritten C alignment module to Cython.
- Fix the
--rest-file
option (issue 56) - Slightly speed up alignment of 5’ adapters.
- Support bzip2-compressed files.
v1.2 (2012-11-30)¶
- At least 25% faster processing of .csfasta/.qual files due to faster parser.
- Between 10% and 30% faster writing of gzip-compressed output files.
- Support 5’ adapters in colorspace, even when no primer trimming is requested.
- Add the
--info-file
option, which has a line for each found adapter. - Named adapters are possible. Usage:
-a My_Adapter=ACCGTA
assigns the name “My_adapter”. - Improve alignment algorithm for better poly-A trimming when there are sequencing errors. Previously, not the longest possible poly-A tail would be trimmed.
- James Casbon contributed the
--discard-untrimmed
option.
v1.1 (2012-06-18)¶
- Allow to “anchor” 5’ adapters (
-g
), forcing them to be a prefix of the read. To use this, add the special character^
to the beginning of the adapter sequence. - Add the “-N” option, which allows ‘N’ characters within adapters to match literally.
- Speedup of approx. 25% when reading from .gz files and using Python 2.7.
- Allow to only trim qualities when no adapter is given on the command-line.
- Add a patch by James Casbon: include read names (ids) in rest file
- Use nosetest for testing. To run, install nose and run “nosetests”.
- When using cutadapt without installing it, you now need to run
bin/cutadapt
due to a new directory layout. - Allow to give a colorspace adapter in basespace (gets automatically converted).
- Allow to search for 5’ adapters (those specified with
-g
) in colorspace. - Speed up the alignment by a factor of at least 3 by using Ukkonen’s algorithm. The total runtime decreases by about 30% in the tested cases.
- allow to deal with colorspace FASTQ files from the SRA that contain a fake
additional quality in the beginning (use
--format sra-fastq
)
v1.0 (2011-11-04)¶
- ASCII-encoded quality values were assumed to be encoded as ascii(quality+33).
With the new parameter
--quality-base
, this can be changed to ascii(quality+64), as used in some versions of the Illumina pipeline. (Fixes issue 7.) - Allow to specify that adapters were ligated to the 5’ end of reads. This change is based on a patch contributed by James Casbon.
- Due to cutadapt being published in EMBnet.journal, I found it appropriate to call this release version 1.0. Please see http://journal.embnet.org/index.php/embnetjournal/article/view/200 for the article and I would be glad if you cite it.
- Add Galaxy support, contributed by Lance Parsons.
- Patch by James Casbon: Allow N wildcards in read or adapter or both.
Wildcard matching of ‘N’s in the adapter is always done. If ‘N’s within reads
should also match without counting as error, this needs to be explicitly
requested via
--match-read-wildcards
.
v0.9.5 (2011-07-20)¶
Fix issue 20: Make the report go to standard output when
-o
/--output
is specified.Recognize .fq as an extension for FASTQ files
many more unit tests
The alignment algorithm has changed. It will now find some adapters that previously were missed. Note that this will produce different output than older cutadapt versions!
Before this change, finding an adapter would work as follows:
- Find an alignment between adapter and read – longer alignments are better.
- If the number of errors in the alignment (divided by length) is above the maximum error rate, report the adapter as not being found.
Sometimes, the long alignment that is found had too many errors, but a shorter alignment would not. The adapter was then incorrectly seen as “not found”. The new alignment algorithm checks the error rate while aligning and only reports alignments that do not have too many errors.
v0.9.4 (2011-05-20)¶
- now compatible with Python 3
- Add the
--zero-cap
option, which changes negative quality values to zero. This is a workaround to avoid segmentation faults in BWA. The option is now enabled by default when--bwa
/--maq
is used. - Lots of unit tests added. Run them with
cd tests && ./tests.sh
. - Fix issue 16:
--discard-trimmed
did not work. - Allow to override auto-detection of input file format with the new
-f
/--format
parameter. This mostly fixes issue 12. - Don’t break when input file is empty.
v0.9.2 (2011-03-16)¶
- Install a single
cutadapt
Python package instead of multiple Python modules. This avoids cluttering the global namespace and should lead to less problems with other Python modules. Thanks to Steve Lianoglou for pointing this out to me! - ignore case (ACGT vs acgt) when comparing the adapter with the read sequence
- .FASTA/.QUAL files (not necessarily colorspace) can now be read (some 454 software uses this format)
- Move some functions into their own modules
- lots of refactoring: replace the fasta module with a much nicer seqio module.
- allow to input FASTA/FASTQ on standard input (also FASTA/FASTQ is autodetected)
v0.9 (2011-01-10)¶
- add
--too-short-output
and--untrimmed-output
, based on patch by Paul Ryvkin (thanks!) - add
--maximum-length
parameter: discard reads longer than a specified length - group options by category in
--help
output - add
--length-tag
option. allows to fix read length in FASTA/Q comment lines (e.g.,length=123
becomeslength=58
after trimming) (requested by Paul Ryvkin) - add
-q
/--quality-cutoff
option for trimming low-quality ends (uses the same algorithm as BWA) - some refactoring
- the filename
-
is now interpreted as standard in or standard output
v0.8 (2010-12-08)¶
- Change default behavior of searching for an adapter: The adapter is now assumed to
be an adapter that has been ligated to the 3’ end. This should be the correct behavior
for at least the SOLiD small RNA protocol (SREK) and also for the Illumina protocol.
To get the old behavior, which uses a heuristic to determine whether the adapter was
ligated to the 5’ or 3’ end and then trimmed the read accordingly, use the new
-b
(--anywhere
) option. - Clear up how the statistics after processing all reads are printed.
- Fix incorrect statistics. Adapters starting at pos. 0 were correctly trimmed, but not counted.
- Modify scoring scheme: Improves trimming (some reads that should have been trimmed were not). Increases no. of trimmed reads in one of our SOLiD data sets from 36.5 to 37.6%.
- Speed improvements (20% less runtime on my test data set).
v0.7 (2010-12-03)¶
- Useful exit codes
- Better error reporting when malformed files are encountered
- Add
--minimum-length
parameter for discarding reads that are shorter than a specified length after trimming. - Generalize the alignment function a bit. This is preparation for supporting adapters that are specific to either the 5’ or 3’ end.
- pure Python fallback for alignment function for when the C module cannot be used.
v0.6 (2010-11-18)¶
- Support gzipped input and output.
- Print timing information in statistics.
v0.5 (2010-11-17)¶
- add
--discard
option which makes cutadapt discard reads in which an adapter occurs
v0.4 (2010-11-17)¶
- (more) correctly deal with multiple adapters: If a long adapter matches with lots of errors, then this could lead to a a shorter adapter matching with few errors getting ignored.
v0.3 (2010-09-27)¶
- fix huge memory usage (entire input file was unintentionally read into memory)
v0.2 (2010-09-14)¶
- allow FASTQ input
v0.1 (2010-09-14)¶
- initial release