Ideas/To Do¶
This is a rather unsorted list of features that would be nice to have, of things that could be improved in the source code, and of possible algorithmic improvements.
- show average error rate
- In colorspace and probably also for Illumina data, gapped alignment is not necessary
- --progress
- run pylint, pychecker
- length histogram
- check whether input is FASTQ although -f fasta is given
- search for adapters in the order in which they are given on the command line
- more tests for the alignment algorithm
- deprecate --rest-file
- --detect prints out best guess which of the given adapters is the correct one
- alignment algorithm: make a ‘banded’ version
- it seems the str.find optimization isn’t very helpful. In any case, it should be moved into the Aligner class.
- allow to remove not the adapter itself, but the sequence before or after it
- convert adapter to lowercase
- warn when given adapter sequence contains non-IUPAC characters
- try multithreading again, this time use os.pipe()
Specifying adapters¶
The idea is to deprecate the -b and -g parameters. Only -a is used with a special syntax for each adapter type. This makes it a bit easier to add new adapter types in the feature.
back | -a ADAPTER | -a ADAPTER or -a ...ADAPTER |
suffix | -a ADAPTER$ | -a ...ADAPTER$ |
front | -g ADAPTER | -a ADAPTER... |
prefix | -g ^ADAPTER | -a ^ADAPTER... |
anywhere | -b ADAPTER | -a ...ADAPTER... ??? |
paired | (not implemented) | -a ADAPTER...ADAPTER or -a ^ADAPTER...ADAPTER |
Or add only -a ADAPTER... as an alias for -g ^ADAPTER and -a ...ADAPTER as an alias for -a ADAPTER.
The ... would be equivalent to N* as in regular expressions.
Another idea: Allow something such as -a ADAP$TER or -a ADAPTER$NNN. This would be a way to specify less strict anchoring.
Make it possible to specify that the rightmost or leftmost match should be picked. Default right now: Leftmost, even for -g adapters.
Allow N{3,10} as in regular expressions (for a variable-length sequence).
Paired-end trimming¶
- Could also use a paired-end read merger, then remove adapters with -a and -g
- Should minimum overlap be sum of the two overlaps in each read?
Single-letter command-line options¶
Remaining characters: All uppercase letters except A, B, G, M, N, O Lowercase letters: i, j, k, l, s, w