Ideas/To Do¶
This is a rather unsorted list of features that would be nice to have, of things that could be improved in the source code, and of possible algorithmic improvements.
- show average error rate
- In colorspace and probably also for Illumina data, gapped alignment is not necessary
--progress- run pylint, pychecker
- length histogram
- check whether input is FASTQ although -f fasta is given
- search for adapters in the order in which they are given on the command line
- more tests for the alignment algorithm
- deprecate
--rest-file --detectprints out best guess which of the given adapters is the correct one- alignment algorithm: make a ‘banded’ version
- it seems the str.find optimization isn’t very helpful. In any case, it should be moved into the Aligner class.
- allow to remove not the adapter itself, but the sequence before or after it
- instead of trimming, convert adapter to lowercase
- warn when given adapter sequence contains non-IUPAC characters
- try multithreading again, this time use os.pipe() or 0mq
Specifying adapters¶
The idea is to deprecate the -b and -g parameters. Only -a is used
with a special syntax for each adapter type. This makes it a bit easier to add
new adapter types in the feature.
| back | -a ADAPTER |
-a ADAPTER or -a ...ADAPTER |
| suffix | -a ADAPTER$ |
-a ...ADAPTER$ |
| front | -g ADAPTER |
-a ADAPTER... |
| prefix | -g ^ADAPTER |
-a ^ADAPTER... (or have anchoring by default?) |
| anywhere | -b ADAPTER |
-a ...ADAPTER... ??? |
| paired | (not implemented) | -a ADAPTER...ADAPTER or -a ^ADAPTER...ADAPTER |
Or add only -a ADAPTER... as an alias for -g ^ADAPTER and
-a ...ADAPTER as an alias for -a ADAPTER.
The ... would be equivalent to N* as in regular expressions.
Another idea: Allow something such as -a ADAP$TER or -a ADAPTER$NNN.
This would be a way to specify less strict anchoring.
Make it possible to specify that the rightmost or leftmost match should be picked. Default right now: Leftmost, even for -g adapters.
Allow N{3,10} as in regular expressions (for a variable-length sequence).
Use parentheses to specify the part of the sequence that should be kept:
-a (...)ADAPTER(default)-a (...ADAPTER)(default)-a ADAPTER(...)(default)-a (ADAPTER...)(??)
Or, specify the part that should be removed:
-a ...(ADAPTER...)-a ...ADAPTER(...)-a (ADAPTER)...
Model somehow all the flags that exist for semiglobal alignment. For start of the adapter:
- Start of adapter can be degraded or not
- Bases are allowed to be before adapter or not
Not degraded and no bases before allowed = anchored. Degraded and bases before allowed = regular 5’
By default, the 5’ end should be anchored, the 3’ end not.
-a ADAPTER...→ not degraded, no bases before allowed-a N*ADAPTER...→ not degraded, bases before allowed-a ADAPTER^...→ degraded, no bases before allowed-a N*ADAPTER^...→ degraded, bases before allowed-a ...ADAPTER→ degraded, bases after allowed-a ...ADAPTER$→ not degraded, no bases after allowed
Paired-end trimming¶
- Could also use a paired-end read merger, then remove adapters with -a and -g
Available/used letters for command-line options¶
- Remaining characters: All uppercase letters except A, B, G, M, N, O, U
- Lowercase letters: i, j, k, l, s, w
- Planned/reserved: Q (paired-end quality trimming), j (multithreading)