Program options for seqkit: SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation Version: 0.15.0 Author: Wei Shen Documents : http://bioinf.shenwei.me/seqkit Source code: https://github.com/shenwei356/seqkit Please cite: https://doi.org/10.1371/journal.pone.0163962 Usage: seqkit [command] Available Commands: amplicon retrieve amplicon (or specific region around it) via primer(s) bam monitoring and online histograms of BAM record features common find common sequences of multiple files by id/name/sequence concat concatenate sequences with same ID from multiple files convert convert FASTQ quality encoding between Sanger, Solexa and Illumina duplicate duplicate sequences N times faidx create FASTA index file and extract subsequence fish look for short sequences in larger sequences using local alignment fq2fa convert FASTQ to FASTA fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew) genautocomplete generate shell autocompletion script grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed head print first N FASTA/Q records help Help about any command locate locate subsequences/motifs, mismatch allowed mutate edit sequence (point mutation, insertion, deletion) pair match up paired-end reads from two fastq files range print FASTA/Q records in a range (start:end) rename rename duplicated IDs replace replace name/sequence by regular expression restart reset start position for circular genome rmdup remove duplicated sequences by id/name/sequence sample sample sequences by number or proportion sana sanitize broken single line fastq files scat real time recursive concatenation and streaming of fastx files seq transform sequences (revserse, complement, extract ID...) shuffle shuffle sequences sliding sliding sequences, circular genome supported sort sort sequences by id/name/sequence/length split split sequences into files by id/seq region/size/parts (mainly for FASTA) split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ) stats simple statistics of FASTA/Q files subseq get subsequences by region/gtf/bed, including flanking sequences tab2fx convert tabular format to FASTA/Q format translate translate DNA/RNA to protein sequence (supporting ambiguous bases) version print version information and check for update watch monitoring and online histograms of sequence features Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) -h, --help help for seqkit --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?") --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments -w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2) Use "seqkit [command] --help" for more information about a command. SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation Version: 0.15.0 Author: Wei Shen Documents : http://bioinf.shenwei.me/seqkit Source code: https://github.com/shenwei356/seqkit Please cite: https://doi.org/10.1371/journal.pone.0163962 Usage: seqkit [command] Available Commands: amplicon retrieve amplicon (or specific region around it) via primer(s) bam monitoring and online histograms of BAM record features common find common sequences of multiple files by id/name/sequence concat concatenate sequences with same ID from multiple files convert convert FASTQ quality encoding between Sanger, Solexa and Illumina duplicate duplicate sequences N times faidx create FASTA index file and extract subsequence fish look for short sequences in larger sequences using local alignment fq2fa convert FASTQ to FASTA fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew) genautocomplete generate shell autocompletion script grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed head print first N FASTA/Q records help Help about any command locate locate subsequences/motifs, mismatch allowed mutate edit sequence (point mutation, insertion, deletion) pair match up paired-end reads from two fastq files range print FASTA/Q records in a range (start:end) rename rename duplicated IDs replace replace name/sequence by regular expression restart reset start position for circular genome rmdup remove duplicated sequences by id/name/sequence sample sample sequences by number or proportion sana sanitize broken single line fastq files scat real time recursive concatenation and streaming of fastx files seq transform sequences (revserse, complement, extract ID...) shuffle shuffle sequences sliding sliding sequences, circular genome supported sort sort sequences by id/name/sequence/length split split sequences into files by id/seq region/size/parts (mainly for FASTA) split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ) stats simple statistics of FASTA/Q files subseq get subsequences by region/gtf/bed, including flanking sequences tab2fx convert tabular format to FASTA/Q format translate translate DNA/RNA to protein sequence (supporting ambiguous bases) version print version information and check for update watch monitoring and online histograms of sequence features Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) -h, --help help for seqkit --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?") --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments -w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2) Use "seqkit [command] --help" for more information about a command. ---------------------------------------------- Program options for seqkit_grep: search sequences by ID/name/sequence/sequence motifs, mismatch allowed Attentions: 0. By default, we match sequence ID with patterns, use "-n/--by-name" for matching full name instead of just ID. 1. Unlike POSIX/GNU grep, we compare the pattern to the whole target (ID/full header) by default. Please switch "-r/--use-regexp" on for partly matching. 2. When searching by sequences, it's partly matching, and both positive and negative strands are searched. Mismatch is allowed using flag "-m/--max-mismatch", but it's not fast enough for large genome like human genome. Though, it's fast enough for microbial genomes. 3. Degenerate bases/residues like "RYMM.." are also supported by flag -d. But do not use degenerate bases/residues in regular expression, you need convert them to regular expression, e.g., change "N" or "X" to ".". 4. When providing search patterns (motifs) via flag '-p', please use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument parser accepts comma-separated-values (CSV) for multiple values (motifs). Patterns in file do not follow this rule. 5. The order of sequences in result is consistent with that in original file, not the order of the query patterns. But for FASTA file, you can use: seqkit faidx seqs.fasta --infile-list IDs.txt You can specify the sequence region for searching with flag -R (--region). The definition of region is 1-based and with some custom design. Examples: 1-based index 1 2 3 4 5 6 7 8 9 10 negative index 0-9-8-7-6-5-4-3-2-1 seq A C G T N a c g t n 1:1 A 2:4 C G T -4:-2 c g t -4:-1 c g t n -1:-1 n 2:-2 C G T N a c g t 1:-1 A C G T N a c g t n 1:12 A C G T N a c g t n -12:-1 A C G T N a c g t n Usage: seqkit grep [flags] Flags: -n, --by-name match by full name instead of just ID -s, --by-seq search subseq on seq, both positive and negative strand are searched, and mismatch allowed using flag -m/--max-mismatch -c, --circular circular genome -d, --degenerate pattern/motif contains degenerate base --delete-matched delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions -h, --help help for grep -i, --ignore-case ignore case -v, --invert-match invert the sense of matching, to select non-matching records -m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster -P, --only-positive-strand only search on positive strand -p, --pattern strings search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"')) -f, --pattern-file string pattern file (one record per line) -R, --region string specify sequence region for searching. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases -r, --use-regexp patterns are regular expression Global Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?") --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments -w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2) ----------------------------------------------