'\" t
.TH samtools 1 "18 July 2018" "samtools-1.9" "Bioinformatics tools"
.SH NAME
samtools \- Utilities for the Sequence Alignment/Map (SAM) format
.\"
.\" Copyright (C) 2008-2011, 2013-2018 Genome Research Ltd.
.\" Portions copyright (C) 2010, 2011 Broad Institute.
.\"
.\" Author: Heng Li <lh3@sanger.ac.uk>
.\" Author: Joshua C. Randall <jcrandall@alum.mit.edu>
.\"
.\" Permission is hereby granted, free of charge, to any person obtaining a
.\" copy of this software and associated documentation files (the "Software"),
.\" to deal in the Software without restriction, including without limitation
.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
.\" and/or sell copies of the Software, and to permit persons to whom the
.\" Software is furnished to do so, subject to the following conditions:
.\"
.\" The above copyright notice and this permission notice shall be included in
.\" all copies or substantial portions of the Software.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
.\" DEALINGS IN THE SOFTWARE.
.
.\" For code blocks and examples (cf groff's Ultrix-specific man macros)
.de EX

.  in +\\$1
.  nf
.  ft CR
..
.de EE
.  ft
.  fi
.  in

..
.
.SH SYNOPSIS
.PP
samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
.PP
samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
.PP
samtools index aln.sorted.bam
.PP
samtools idxstats aln.sorted.bam
.PP
samtools flagstat aln.sorted.bam
.PP
samtools stats aln.sorted.bam
.PP
samtools bedcov aln.sorted.bam
.PP
samtools depth aln.sorted.bam
.PP
samtools view aln.sorted.bam chr2:20,100,000-20,200,000
.PP
samtools merge out.bam in1.bam in2.bam in3.bam
.PP
samtools faidx ref.fasta
.PP
samtools fqidx ref.fastq
.PP
samtools tview aln.sorted.bam ref.fasta
.PP
samtools split merged.bam
.PP
samtools quickcheck in1.bam in2.cram
.PP
samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
.PP
samtools fixmate in.namesorted.sam out.bam
.PP
samtools mpileup -C50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
.PP
samtools flags PAIRED,UNMAP,MUNMAP
.PP
samtools fastq input.bam > output.fastq
.PP
samtools fasta input.bam > output.fasta
.PP
samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o output.bam input.bam
.PP
samtools collate -o aln.name_collated.bam aln.sorted.bam
.PP
samtools depad input.bam
.PP
samtools markdup in.algnsorted.bam out.bam

.SH DESCRIPTION
.PP
Samtools is a set of utilities that manipulate alignments in the BAM
format. It imports from and exports to the SAM (Sequence Alignment/Map)
format, does sorting, merging and indexing, and allows to retrieve reads
in any regions swiftly.

Samtools is designed to work on a stream. It regards an input file `-'
as the standard input (stdin) and an output file `-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the standard
error output (stderr).

Samtools is also able to open a BAM (not SAM) file on a remote FTP or
HTTP server if the BAM file name starts with `ftp://' or `http://'.
Samtools checks the current working directory for the index file and
will download the index upon absence. Samtools does not retrieve the
entire alignment file unless it is asked to do so.

.SH COMMANDS AND OPTIONS

.TP 10 \"-------- view
.B view
samtools view
.RI [ options ]
.IR in.sam | in.bam | in.cram
.RI [ region ...]

With no options or regions specified, prints all alignments in the specified
input alignment file (in SAM, BAM, or CRAM format) to standard output
in SAM format (with no header).

You may specify one or more space-separated region specifications after the
input filename to restrict output to only those alignments which overlap the
specified region(s). Use of region specifications requires a coordinate-sorted
and indexed input file (in BAM or CRAM format).

The
.BR -b ,
.BR -C ,
.BR -1 ,
.BR -u ,
.BR -h ,
.BR -H ,
and
.B -c
options change the output format from the default of headerless SAM, and the
.B -o
and
.B -U
options set the output file name(s).

The
.B -t
and
.B -T
options provide additional reference data. One of these two options is required
when SAM input does not contain @SQ headers, and the
.B -T
option is required whenever writing CRAM output.

The
.BR -L ,
.BR -M ,
.BR -r ,
.BR -R ,
.BR -s ,
.BR -q ,
.BR -l ,
.BR -m ,
.BR -f ,
.BR -F ,
and
.B -G
options filter the alignments that will be included in the output to only those
alignments that match certain criteria.

The
.B -x
and
.B -B
options modify the data which is contained in each alignment.

Finally, the
.B -@
option can be used to allocate additional threads to be used for compression, and the
.B -?
option requests a long help message.

.TP
.B REGIONS:
.RS
Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position
coordinates are 1-based.

Important note: when multiple regions are given, some alignments may be output
multiple times if they overlap more than one of the specified regions.

Examples of region specifications:
.TP 10
.B chr1
Output all alignments mapped to the reference sequence named `chr1' (i.e. @SQ SN:chr1).
.TP
.B chr2:1000000
The region on chr2 beginning at base position 1,000,000 and ending at the
end of the chromosome.
.TP
.B chr3:1000-2000
The 1001bp region on chr3 beginning at base position 1,000 and ending at base
position 2,000 (including both end positions).
.TP
.B '*'
Output the unmapped reads at the end of the file.
(This does not include any unmapped reads placed on a reference sequence
alongside their mapped mates.)
.TP
.B .
Output all alignments.
(Mostly unnecessary as not specifying a region at all has the same effect.)
.RE

.B OPTIONS:
.RS
.TP 10
.B -b
Output in the BAM format.
.TP
.B -C
Output in the CRAM format (requires -T).
.TP
.B -1
Enable fast BAM compression (implies -b).
.TP
.B -u
Output uncompressed BAM. This option saves time spent on
compression/decompression and is thus preferred when the output is piped
to another samtools command.
.TP
.B -h
Include the header in the output.
.TP
.B -H
Output the header only.
.TP
.B -c
Instead of printing the alignments, only count them and print the
total number. All filter options, such as
.BR -f ,
.BR -F ,
and
.BR -q ,
are taken into account.
.TP
.B -?
Output long help and exit immediately.
.TP
.BI "-o " FILE
Output to
.I FILE [stdout].
.TP
.BI "-U " FILE
Write alignments that are
.I not
selected by the various filter options to
.IR FILE .
When this option is used, all alignments (or all alignments intersecting the
.I regions
specified) are written to either the output file or this file, but never both.
.TP
.BI "-t " FILE
A tab-delimited
.IR FILE .
Each line must contain the reference name in the first column and the length of
the reference in the second column, with one line for each distinct reference.
Any additional fields beyond the second column are ignored. This file also
defines the order of the reference sequences in sorting. If you run:
`samtools faidx <ref.fa>', the resulting index file
.I <ref.fa>.fai
can be used as this
.IR FILE .
.TP
.BI "-T " FILE
A FASTA format reference
.IR FILE ,
optionally compressed by
.B bgzip
and ideally indexed by
.B samtools
.BR faidx .
If an index is not present, one will be generated for you.
.TP
.BI "-L " FILE
Only output alignments overlapping the input BED
.I FILE
[null].
.TP
.B "-M "
Use the multi-region iterator on the union of the BED file and
command-line region arguments.  This avoids re-reading the same regions
of files so can sometimes be much faster.  Note this also removes
duplicate sequences.  Without this a sequence that overlaps multiple
regions specified on the command line will be reported multiple times.
.TP
.BI "-r " STR
Output alignments in read group
.I STR
[null].
Note that records with no
.B RG
tag will also be output when using this option.
This behaviour may change in a future release.
.TP
.BI "-R " FILE
Output alignments in read groups listed in
.I FILE
[null].
Note that records with no
.B RG
tag will also be output when using this option.
This behaviour may change in a future release.
.TP
.BI "-q " INT
Skip alignments with MAPQ smaller than
.I INT
[0].
.TP
.BI "-l " STR
Only output alignments in library
.I STR
[null].
.TP
.BI "-m " INT
Only output alignments with number of CIGAR bases consuming query
sequence \(>=
.I INT
[0]
.TP
.BI "-f " INT
Only output alignments with all bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
.TP
.BI "-F " INT
Do not output alignments with any bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
.TP
.BI "-G " INT
Do not output alignments with all bits set in
.I INT
present in the FLAG field.  This is the opposite of \fI-f\fR such
that \fI-f12 -G12\fR is the same as no filtering at all.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
.TP
.BI "-x " STR
Read tag to exclude from output (repeatable) [null]
.TP
.B -B
Collapse the backward CIGAR operation.
.TP
.BI "-s " FLOAT
Output only a proportion of the input alignments.
This subsampling acts in the same way on all of the alignment records in
the same template or read pair, so it never keeps a read but not its mate.
.IP
The integer and fractional parts of the
.BI "-s " INT . FRAC
option are used separately: the part after the
decimal point sets the fraction of templates/pairs to be kept,
while the integer part is used as a seed that influences
.I which
subset of reads is kept.
.IP
.\" Reads are retained based on a score computed by hashing their QNAME
.\" field and the seed value.
When subsampling data that has previously been subsampled, be sure to use
a different seed value from those used previously; otherwise more reads
will be retained than expected.
.TP
.BI "-@ " INT
Number of BAM compression threads to use in addition to main thread [0].
.TP
.B -S
Ignored for compatibility with previous samtools versions.
Previously this option was required if input was in SAM format, but now the
correct format is automatically detected by examining the first few characters
of input.
.RE

.TP \"-------- sort
.B sort
.na
samtools sort
.RB [ -l
.IR level ]
.RB [ -m
.IR maxMem ]
.RB [ -o
.IR out.bam ]
.RB [ -O
.IR format ]
.RB [ -n ]
.RB [ -t
.IR tag ]
.RB [ -T
.IR tmpprefix ]
.RB [ -@
.IR threads "] [" in.sam | in.bam | in.cram ]
.ad

Sort alignments by leftmost coordinates, or by read name when
.B -n
is used.
An appropriate
.B @HD-SO
sort order header tag will be added or an existing one updated if necessary.

The sorted output is written to standard output by default, or to the
specified file
.RI ( out.bam )
when
.B -o
is used.
This command will also create temporary files
.IB tmpprefix . %d .bam
as needed when the entire alignment data cannot fit into memory
(as controlled via the
.B -m
option).

.B Options:
.RS
.TP 11
.BI "-l " INT
Set the desired compression level for the final output file, ranging from 0
(uncompressed) or 1 (fastest but minimal compression) to 9 (best compression
but slowest to write), similarly to
.BR gzip (1)'s
compression level setting.
.IP
If
.B -l
is not used, the default compression level will apply.
.TP
.BI "-m " INT
Approximately the maximum required memory per thread, specified either in bytes
or with a
.BR K ", " M ", or " G
suffix.
[768 MiB]
.IP
To prevent sort from creating a huge number of temporary files, it enforces a
minimum value of 1M for this setting.
.TP
.B -n
Sort by read names (i.e., the
.B QNAME
field) rather than by chromosomal coordinates.
.TP
.BI "-t " TAG
Sort first by the value in the alignment tag TAG, then by position or name (if
also using \fB-n\fP).
.BI "-o " FILE
Write the final sorted output to
.IR FILE ,
rather than to standard output.
.TP
.BI "-O " FORMAT
Write the final output as
.BR sam ", " bam ", or " cram .

By default, samtools tries to select a format based on the
.B -o
filename extension; if output is to standard output or no format can be
deduced,
.B bam
is selected.
.TP
.BI "-T " PREFIX
Write temporary files to
.IB PREFIX . nnnn .bam,
or if the specified
.I PREFIX
is an existing directory, to
.IB PREFIX /samtools. mmm . mmm .tmp. nnnn .bam,
where
.I mmm
is unique to this invocation of the
.B sort
command.
.IP
By default, any temporary files are written alongside the output file, as
.IB out.bam .tmp. nnnn .bam,
or if output is to standard output, in the current directory as
.BI samtools. mmm . mmm .tmp. nnnn .bam.
.TP
.BI "-@ " INT
Set number of sorting and compression threads.
By default, operation is single-threaded.
.PP
.B Ordering Rules

The following rules are used for ordering records.

If option \fB-t\fP is in use, records are first sorted by the value of
the given alignment tag, and then by position or name (if using \fB-n\fP).
For example, \*(lq-t RG\*(rq will make read group the primary sort key.  The
rules for ordering by tag are:

.IP \(bu 4
Records that do not have the tag are sorted before ones that do.
.IP \(bu 4
If the types of the tags are different, they will be sorted so
that single character tags (type A) come before array tags (type B), then
string tags (types H and Z), then numeric tags (types f and i).
.IP \(bu 4
Numeric tags (types f and i) are compared by value.  Note that comparisons
of floating-point values are subject to issues of rounding and precision.
.IP \(bu 4
String tags (types H and Z) are compared based on the binary
contents of the tag using the C
.BR strcmp (3)
function.
.IP \(bu 4
Character tags (type A) are compared by binary character value.
.IP \(bu 4
No attempt is made to compare tags of other types \(em notably type B
array values will not be compared.
.PP
When the \fB-n\fP option is present, records are sorted by name.  Names are
compared so as to give a \*(lqnatural\*(rq ordering \(em i.e. sections
consisting of digits are compared numerically while all other sections are
compared based on their binary representation.  This means \*(lqa1\*(rq will
come before \*(lqb1\*(rq and \*(lqa9\*(rq will come before \*(lqa10\*(rq.
Records with the same name will be ordered according to the values of
the READ1 and READ2 flags (see
.BR flags ).

When the \fB-n\fP option is
.B not
present, reads are sorted by reference (according to the order of the @SQ
header records), then by position in the reference, and then by the REVERSE
flag.

.B Note

.PP
Historically
.B samtools sort
also accepted a less flexible way of specifying the final and
temporary output filenames:
.IP
samtools sort
.RB [ -f "] [" -o ]
.I in.bam out.prefix
.PP
This has now been removed.
The previous \fIout.prefix\fP argument (and \fB-f\fP option, if any)
should be changed to an appropriate combination of \fB-T\fP \fIPREFIX\fP
and \fB-o\fP \fIFILE\fP.  The previous \fB-o\fP option should be removed,
as output defaults to standard output.
.RE

.TP \"-------- index
.B index
samtools index
.RB [ -bc ]
.RB [ -m
.IR INT ]
.IR aln.bam | aln.cram
.RI [ out.index ]

Index a coordinate-sorted BAM or CRAM file for fast random access.
(Note that this does not work with SAM files even if they are bgzip
compressed \(em to index such files, use tabix(1) instead.)

This index is needed when
.I region
arguments are used to limit
.B samtools view
and similar commands to particular regions of interest.

If an output filename is given, the index file will be written to
.IR out.index .
Otherwise, for a CRAM file
.IR aln.cram ,
index file
.IB aln.cram .crai
will be created; for a BAM file
.IR aln.bam ,
either
.IB aln.bam .bai
or
.IB aln.bam .csi
will be created, depending on the index format selected.

.B Options:
.RS
.TP 8
.B -b
Create a BAI index.
This is currently the default when no format options are used.
.TP
.B -c
Create a CSI index.
By default, the minimum interval size for the index is 2^14, which is the same
as the fixed value used by the BAI format.
.TP
.BI "-m " INT
Create a CSI index, with a minimum interval size of 2^INT.
.RE

.TP \"-------- idxstats
.B idxstats
samtools idxstats
.IR in.sam | in.bam | in.cram

Retrieve and print stats in the index file corresponding to the input file.
Before calling idxstats, the input BAM file should be indexed by samtools index.

If run on a SAM or CRAM file or an unindexed BAM file, this command
will still produce the same summary statistics, but does so by reading
through the entire file.  This is far slower than using the BAM
indices.

The output is TAB-delimited with each line consisting of reference sequence
name, sequence length, # mapped reads and # unmapped reads. It is written to
stdout.

.TP \"-------- flagstat
.B flagstat
samtools flagstat
.IR in.sam | in.bam | in.cram

Does a full pass through the input file to calculate and print statistics
to stdout.

Provides counts for each of 13 categories based primarily on bit flags in
the FLAG field. Each category in the output is broken down into QC pass and
QC fail, which is presented as "#PASS + #FAIL" followed by a description of
the category.

The first row of output gives the total number of reads that are QC pass and
fail (according to flag bit 0x200). For example:

  122 + 28 in total (QC-passed reads + QC-failed reads)

Which would indicate that there are a total of 150 reads in the input file,
122 of which are marked as QC pass and 28 of which are marked as "not passing
quality controls"

Following this, additional categories are given for reads which are:

.RS 18
.TP
secondary
0x100 bit set
.TP
supplementary
0x800 bit set
.TP
duplicates
0x400 bit set
.TP
mapped
0x4 bit not set
.TP
paired in sequencing
0x1 bit set
.TP
read1
both 0x1 and 0x40 bits set
.TP
read2
both 0x1 and 0x80 bits set
.TP
properly paired
both 0x1 and 0x2 bits set and 0x4 bit not set
.TP
with itself and mate mapped
0x1 bit set and neither 0x4 nor 0x8 bits set
.TP
singletons
both 0x1 and 0x8 bits set and bit 0x4 not set
.RE

.RS 10
And finally, two rows are given that additionally filter on the reference
name (RNAME), mate reference name (MRNM), and mapping quality (MAPQ) fields:
.RE

.RS 18
.TP
with mate mapped to a different chr
0x1 bit set and neither 0x4 nor 0x8 bits set and MRNM not equal to RNAME
.TP
with mate mapped to a different chr (mapQ>=5)
0x1 bit set and neither 0x4 nor 0x8 bits set
and MRNM not equal to RNAME and MAPQ >= 5
.RE

.TP \"-------- stats
.B stats
samtools stats
.RI [ options ]
.IR in.sam | in.bam | in.cram
.RI [ region ...]

samtools stats collects statistics from BAM files and outputs in a text format.
The output can be visualized graphically using plot-bamstats.

.B Options:
.RS
.TP 8
.BI "-c, --coverage " MIN , MAX , STEP
Set coverage distribution to the specified range (MIN, MAX, STEP all given as integers)
[1,1000,1]
.TP
.B -d, --remove-dups
Exclude from statistics reads marked as duplicates
.TP
.BI "-f, --required-flag "  STR "|" INT
Required flag, 0 for unset. See also `samtools flags`
[0]
.TP
.BI "-F, --filtering-flag " STR "|" INT
Filtering flag, 0 for unset. See also `samtools flags`
[0]
.TP
.BI "--GC-depth " FLOAT
the size of GC-depth bins (decreasing bin size increases memory requirement)
[2e4]
.TP
.B -h, --help
This help message
.TP
.BI "-i, --insert-size " INT
Maximum insert size
[8000]
.TP
.BI "-I, --id " STR
Include only listed read group or sample name
[]
.TP
.BI "-l, --read-length " INT
Include in the statistics only reads with the given read length
[]
.TP
.BI "-m, --most-inserts " FLOAT
Report only the main part of inserts
[0.99]
.TP
.BI "-P, --split-prefix " STR
A path or string prefix to prepend to filenames output when creating
categorised statistics files with
.BR -S / --split .
[input filename]
.TP
.BI "-q, --trim-quality " INT
The BWA trimming parameter
[0]
.TP
.BI "-r, --ref-seq " FILE
Reference sequence (required for GC-depth and mismatches-per-cycle calculation).
[]
.TP
.BI "-S, --split " TAG
In addition to the complete statistics, also output categorised statistics
based on the tagged field
.I TAG
(e.g., use
.B --split RG
to split into read groups).

Categorised statistics are written to files named
.RI < prefix >_< value >.bamstat,
where
.I prefix
is as given by
.B --split-prefix
(or the input filename by default) and
.I value
has been encountered as the specified tagged field's value in one or more
alignment records.
.TP
.BI "-t, --target-regions " FILE
Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.
[]
.TP
.B "-x, --sparse"
Suppress outputting IS rows where there are no insertions.
.RE

.TP \"-------- bedcov
.B bedcov
samtools bedcov
.RI [ options ]
.IR region.bed " " in1.sam | in1.bam | in1.cram "[...]"

Reports the total read base count (i.e. the sum of per base read depths)
for each genomic region specified in the supplied BED file. The regions
are output as they appear in the BED file and are 0-based.
Counts for each alignment file supplied are reported in separate columns.

.B Options:
.RS
.TP
.BI "-Q " INT
.RI "Only count reads with mapping quality greater than " INT
.TP
.B  -j
Do not include deletions (D) and ref skips (N) in bedcov computation.
.RE

.TP \"-------- depth
.B depth
samtools depth
.RI [ options ]
.RI "[" in1.sam | in1.bam | in1.cram " [" in2.sam | in2.bam | in2.cram "] [...]]"

Computes the depth at each position or region.

.B Options:
.RS
.TP 8
.B -a
Output all positions (including those with zero depth)
.TP
.B -a -a, -aa
Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence
has coverage outside of the region specified in the BED file.
.TP
.BI "-b "  FILE
.RI "Compute depth at list of positions or regions in specified BED " FILE.
[]
.TP
.BI "-f " FILE
.RI "Use the BAM files specified in the " FILE
(a file of filenames, one file per line)
[]
.TP
.BI "-l " INT
.RI "Ignore reads shorter than " INT
.TP
.BI "-m, -d " INT
.RI "Truncate reported depth at a maximum of " INT " reads."
[8000]. If 0, depth is set to the maximum integer value, effectively removing any depth limit.
.TP
.BI "-q " INT
.RI "Only count reads with base quality greater than " INT
.TP
.BI "-Q " INT
.RI "Only count reads with mapping quality greater than " INT
.TP
.BI "-r " CHR ":" FROM "-" TO
Only report depth in specified region.
.RE

.TP \"-------- merge
.B merge
samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> [<in2.bam> <in3.bam> ... <inN.bam>]

Merge multiple sorted alignment files, producing a single sorted output file
that contains all the input records and maintains the existing sort order.

If
.BR -h
is specified the @SQ headers of input files will be merged into the specified header, otherwise they will be merged
into a composite header created from the input headers.  If in the process of merging @SQ lines for coordinate sorted
input files, a conflict arises as to the order (for example input1.bam has @SQ for a,b,c and input2.bam has b,a,c)
then the resulting output file will need to be re-sorted back into coordinate order.

Unless the
.BR -c
or
.BR -p
flags are specified then when merging @RG and @PG records into the output header then any IDs found to be duplicates
of existing IDs in the output header will have a suffix appended to them to differentiate them from similar header
records from other files and the read records will be updated to reflect this.

The ordering of the records in the input files must match the usage of the
\fB-n\fP and \fB-t\fP command-line options.  If they do not, the output
order will be undefined.  See
.B sort
for information about record ordering.

.B OPTIONS:
.RS
.TP 8
.B -1
Use zlib compression level 1 to compress the output.
.TP
.BI -b \ FILE
List of input BAM files, one file per line.
.TP
.B -f
Force to overwrite the output file if present.
.TP 8
.BI -h \ FILE
Use the lines of
.I FILE
as `@' headers to be copied to
.IR out.bam ,
replacing any header lines that would otherwise be copied from
.IR in1.bam .
.RI ( FILE
is actually in SAM format, though any alignment records it may contain
are ignored.)
.TP
.B -n
The input alignments are sorted by read names rather than by chromosomal
coordinates
.TP
.B -t TAG
The input alignments have been sorted by the value of TAG, then by either
position or name (if \fB-n\fP is given).
.TP
.BI -R \ STR
Merge files in the specified region indicated by
.I STR
[null]
.TP
.B -r
Attach an RG tag to each alignment. The tag value is inferred from file names.
.TP
.B -u
Uncompressed BAM output
.TP
.B -c
When several input files contain @RG headers with the same ID, emit only one
of them (namely, the header line from the first file we find that ID in) to
the merged output file.
Combining these similar headers is usually the right thing to do when the
files being merged originated from the same file.

Without \fB-c\fP, all @RG headers appear in the output file, with random
suffixes added to their IDs where necessary to differentiate them.
.TP
.B -p
Similarly, for each @PG ID in the set of files to merge, use the @PG line
of the first file we find that ID in rather than adding a suffix to
differentiate similar IDs.
.RE

.TP \"-------- faidx
.B faidx
samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from
indexed reference sequence. If no region is specified,
.B faidx
will index the file and create
.I <ref.fasta>.fai
on the disk. If regions are specified, the subsequences will be
retrieved and printed to stdout in the FASTA format.

The input file can be compressed in the
.B BGZF
format.

The sequences in the input file should all have different names.
If they do not, indexing will emit a warning about duplicate sequences and
retrieval will only produce subsequences from the first sequence with the
duplicated name.

FASTQ files can be read and indexed by this command.  Without using
.B --fastq
any extracted subsequence will be in FASTA format.

.B Options
.RS
.TP 8
.BI "-o, --output " FILE
Write FASTA to file rather than to stdout.
.TP
.BI "-n, --length " INT
Length of FASTA sequence line.
[60]
.TP
.B -c, --continue
Continue working if a non-existant region is requested.
.TP
.BI "-r, --region-file " FILE
Read regions from a file. Format is chr:from-to, one per line.
.TP
.B -f, --fastq
Read FASTQ files and output extracted sequences in FASTQ format.  Same as using samtools fqidx.
.TP
.B -i, --reverse-complement
Output the sequence as the reverse complement.
When this option is used, \*(lq/rc\*(rq will be appended to the sequence names.
To turn this off or change the string appended, use the
.B --mark-strand
option.
.TP
.B     --mark-strand TYPE
Append strand indicator to sequence name.  TYPE can be one of:
.RS
.TP
.B rc
Append '/rc' when writing the reverse complement.  This is the default.
.TP
.B no
Do not append anything.
.TP
.B sign
Append '(+)' for forward strand or '(-)' for reverse complement.  This matches
the output of \*(lqbedtools getfasta -s\*(rq.
.TP
.B custom,<pos>,<neg>
Append string <pos> to names when writing the forward strand and <neg> when
writing the reverse strand.  Spaces are preserved, so it is possible to move
the indicator into the comment part of the description line by including
a leading space in the strings <pos> and <neg>.
.RE
.TP
.B -h, --help
Print help message and exit.
.RE

.TP \"-------- fqidx
.B fqidx
samtools fqidx <ref.fastq> [region1 [...]]

Index reference sequence in the FASTQ format or extract subsequence from
indexed reference sequence. If no region is specified,
.B fqidx
will index the file and create
.I <ref.fastq>.fai
on the disk. If regions are specified, the subsequences will be
retrieved and printed to stdout in the FASTQ format.

The input file can be compressed in the
.B BGZF
format.

The sequences in the input file should all have different names.
If they do not, indexing will emit a warning about duplicate sequences and
retrieval will only produce subsequences from the first sequence with the
duplicated name.

.B samtools fqidx
should only be used on fastq files with a small number of entries.
Trying to use it on a file containing millions of short sequencing reads
will produce an index that is almost as big as the original file, and
searches using the index will be very slow and use a lot of memory.

.B Options
.RS
.TP 8
.BI "-o, --output " FILE
Write FASTQ to file rather than to stdout.
.TP
.BI "-n, --length " INT
Length of FASTQ sequence line.
[60]
.TP
.B -c, --continue
Continue working if a non-existant region is requested.
.TP
.BI "-r, --region-file " FILE
Read regions from a file. Format is chr:from-to, one per line.
.TP
.B -i, --reverse-complement
Output the sequence as the reverse complement.
When this option is used, \*(lq/rc\*(rq will be appended to the sequence names.
To turn this off or change the string appended, use the
.B --mark-strand
option.
.TP
.B     --mark-strand TYPE
Append strand indicator to sequence name.  TYPE can be one of:
.RS
.TP
.B rc
Append '/rc' when writing the reverse complement.  This is the default.
.TP
.B no
Do not append anything.
.TP
.B sign
Append '(+)' for forward strand or '(-)' for reverse complement.  This matches
the output of \*(lqbedtools getfasta -s\*(rq.
.TP
.B custom,<pos>,<neg>
Append string <pos> to names when writing the forward strand and <neg> when
writing the reverse strand.  Spaces are preserved, so it is possible to move
the indicator into the comment part of the description line by including
a leading space in the strings <pos> and <neg>.
.RE
.TP
.B -h, --help
Print help message and exit.
.RE

.TP \"-------- tview
.B tview
samtools tview
.RB [ -p
.IR chr:pos ]
.RB [ -s
.IR STR ]
.RB [ -d
.IR display ]
.RI <in.sorted.bam>
.RI [ref.fasta]

Text alignment viewer (based on the ncurses library). In the viewer,
press `?' for help and press `g' to check the alignment start from a
region in the format like `chr10:10,000,000' or `=10,000,000' when
viewing the same reference sequence.

.B Options:
.RS
.TP 14
.BI -d \ display
Output as (H)tml or (C)urses or (T)ext
.TP
.BI -p \ chr:pos
Go directly to this position
.TP
.BI -s \ STR
Display only alignments from this sample or read group
.RE

.TP \"-------- split
.B split
samtools split
.RI [ options ]
.IR merged.sam | merged.bam | merged.cram

Splits a file by read group.

.B Options:
.RS
.TP 14
.BI "-u " FILE1
.RI "Put reads with no RG tag or an unrecognised RG tag into " FILE1
.TP
.BI "-u " FILE1 ":" FILE2
.RI "As above, but assigns an RG tag as given in the header of " FILE2
.TP
.BI "-f " STRING
Output filename format string (see below)
["%*_%#.%."]
.TP
.B -v
Verbose output
.PP
Format string expansions:
.TS
center;
lb l .
%%	%
%*	basename
%#	@RG index
%!	@RG ID
%.	output format filename extension
.TE
.RE

.TP \"-------- quickcheck
.B quickcheck
samtools quickcheck
.RI [ options ]
.IR in.sam | in.bam | in.cram
[ ... ]

Quickly check that input files appear to be intact. Checks that beginning of the
file contains a valid header (all formats) containing at least one target
sequence and then seeks to the end of the file and checks that an end-of-file
(EOF) is present and intact (BAM only).

Data in the middle of the file is not read since that would be much more time
consuming, so please note that this command will not detect internal corruption,
but is useful for testing that files are not truncated before performing more
intensive tasks on them.

This command will exit with a non-zero exit code if any input files don't have a
valid header or are missing an EOF block. Otherwise it will exit successfully
(with a zero exit code).

.B Options:
.RS
.TP 8
.B -v
Verbose output: will additionally print the names of all input files that don't
pass the check to stdout. Multiple -v options will cause additional messages
regarding check results to be printed to stderr.
.TP 8
.B -q
Quiet mode: disables warning messages on stderr about files that fail.
If both -q and -v options are used then the appropriate level of -v takes precedence.
.RE

.TP \"-------- dict
.B dict
samtools dict <ref.fasta|ref.fasta.gz>

Create a sequence dictionary file from a fasta file.

.B OPTIONS:
.RS
.TP 11
.BI -a,\ --assembly \ STR
Specify the assembly for the AS tag.
.TP
.B -H,\ --no-header
Do not print the @HD header line.
.TP
.BI -o,\ --output \ FILE
Output to
.I FILE
[stdout].
.TP
.BI -s,\ --species \ STR
Specify the species for the SP tag.
.TP
.BI -u,\ --uri \ STR
Specify the URI for the UR tag. Defaults to
the absolute path of
.I ref.fasta
unless reading from stdin.
.RE

.TP \"-------- fixmate
.B fixmate
.na
samtools fixmate
.RB [ -rpcm ]
.RB [ -O
.IR format ]
.I in.nameSrt.bam out.bam
.ad

Fill in mate coordinates, ISIZE and mate related flags from a
name-sorted alignment.

.B OPTIONS:
.RS
.TP 11
.B -r
Remove secondary and unmapped reads.
.TP
.B -p
Disable FR proper pair check.
.TP
.B -c
Add template cigar ct tag.
.TP
.B -m
Add ms (mate score) tags.  These are used by
.B markdup
to select the best reads to keep.
.TP
.BI "-O " FORMAT
Write the final output as
.BR sam ", " bam ", or " cram .

By default, samtools tries to select a format based on the output
filename extension; if output is to standard output or no format can be
deduced,
.B bam
is selected.
.RE

.TP \"-------- mpileup
.B mpileup
samtools mpileup
.RB [ -EB ]
.RB [ -C
.IR capQcoef ]
.RB [ -r
.IR reg ]
.RB [ -f
.IR in.fa ]
.RB [ -l
.IR list ]
.RB [ -Q
.IR minBaseQ ]
.RB [ -q
.IR minMapQ ]
.I in.bam
.RI [ in2.bam
.RI [ ... ]]

Generate pileup for one or multiple BAM files. Alignment records
are grouped by sample (SM) identifiers in @RG header lines. If sample
identifiers are absent, each input file is regarded as one sample.

Samtools mpileup can still produce VCF and BCF output, but this feature is
deprecated and will be removed in a future release.  Please use
.B bcftools mpileup
for this instead.  (Documentation on the deprecated options has been removed
from this manual page, but older versions are available online
at <http://www.htslib.org/doc/>.)

In the pileup format (without
.BR -u \ or \ -g ),
each
line represents a genomic position, consisting of chromosome name,
1-based coordinate, reference base, the number of reads covering the site,
read bases, base qualities and alignment
mapping qualities. Information on match, mismatch, indel, strand,
mapping quality and start and end of a read are all encoded at the read
base column. At this column, a dot stands for a match to the reference
base on the forward strand, a comma for a match on the reverse strand,
a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
strand and `acgtn' for a mismatch on the reverse strand. A pattern
`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
reference position and the next reference position. The length of the
insertion is given by the integer in the pattern, followed by the
inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
represents a deletion from the reference. The deleted bases will be
presented as `*' in the following lines. Also at the read base column, a
symbol `^' marks the start of a read. The ASCII of the character
following `^' minus 33 gives the mapping quality. A symbol `$' marks the
end of a read segment.

Note that there are two orthogonal ways to specify locations in the
input file; via \fB-r\fR \fIregion\fR and \fB-l\fR \fIfile\fR.  The
former uses (and requires) an index to do random access while the
latter streams through the file contents filtering out the specified
regions, requiring no index.  The two may be used in conjunction.  For
example a BED file containing locations of genes in chromosome 20
could be specified using \fB-r 20 -l chr20.bed\fR, meaning that the
index is used to find chromosome 20 and then it is filtered for the
regions listed in the bed file.

.B Input Options:
.RS
.TP 10
.B -6, --illumina1.3+
Assume the quality is in the Illumina 1.3+ encoding.
.TP
.B -A, --count-orphans
Do not skip anomalous read pairs in variant calling.
.TP
.BI -b,\ --bam-list \ FILE
List of input BAM files, one file per line [null]
.TP
.B -B, --no-BAQ
Disable base alignment quality (BAQ) computation.
See
.B BAQ
below.
.TP
.BI -C,\ --adjust-MQ \ INT
Coefficient for downgrading mapping quality for reads containing
excessive mismatches. Given a read with a phred-scaled probability q of
being generated from the mapped position, the new mapping quality is
about sqrt((INT-q)/INT)*INT. A zero value disables this
functionality; if enabled, the recommended value for BWA is 50. [0]
.TP
.BI -d,\ --max-depth \ INT
At a position, read maximally
.I INT
reads per input file. Setting this limit reduces the amount of memory and
time needed to process regions with very high coverage.  Passing zero for this
option sets it to the highest possible value, effectively removing the depth
limit. [8000]

Note that up to release 1.8, samtools would enforce a minimum value for
this option.  This no longer happens and the limit is set exactly as
specified.
.TP
.B -E, --redo-BAQ
Recalculate BAQ on the fly, ignore existing BQ tags.
See
.B BAQ
below.
.TP
.BI -f,\ --fasta-ref \ FILE
The
.BR faidx -indexed
reference file in the FASTA format. The file can be optionally compressed by
.BR bgzip .
[null]

Supplying a reference file will enable base alignment quality calculation
for all reads aligned to a reference in the file.  See
.B BAQ
below.
.TP
.BI -G,\ --exclude-RG \ FILE
Exclude reads from readgroups listed in FILE (one @RG-ID per line)
.TP
.BI -l,\ --positions \ FILE
BED or position list file containing a list of regions or sites where
pileup or BCF should be generated. Position list files contain two
columns (chromosome and position) and start counting from 1.  BED
files contain at least 3 columns (chromosome, start and end position)
and are 0-based half-open.
.br
While it is possible to mix both position-list and BED coordinates in
the same file, this is strongly ill advised due to the differing
coordinate systems. [null]
.TP
.BI -q,\ -min-MQ \ INT
Minimum mapping quality for an alignment to be used [0]
.TP
.BI -Q,\ --min-BQ \ INT
Minimum base quality for a base to be considered [13]
.TP
.BI -r,\ --region \ STR
Only generate pileup in region. Requires the BAM files to be indexed.
If used in conjunction with -l then considers the intersection of the
two requests.
.I STR
[all sites]
.TP
.B -R,\ --ignore-RG
Ignore RG tags. Treat all reads in one BAM as one sample.
.TP
.BI --rf,\ --incl-flags \ STR|INT
Required flags: skip reads with mask bits unset [null]
.TP
.BI --ff,\ --excl-flags \ STR|INT
Filter flags: skip reads with mask bits set
[UNMAP,SECONDARY,QCFAIL,DUP]
.TP
.B -x,\ --ignore-overlaps
Disable read-pair overlap detection.
.PP
.B Output Options:
.TP 10
.BI "-o, --output " FILE
Write pileup output to
.IR FILE ,
rather than the default of standard output.

(The same short option is used for both the deprecated
.BR --open-prob
option and
.B --output .
If
.BR -o 's
argument contains any non-digit characters other than a leading + or - sign,
it is interpreted as
.BR --output .
Usually the filename extension will take care of this, but to write to an
entirely numeric filename use
.B -o ./123
or
.BR "--output 123" .)
.TP
.B -O, --output-BP
Output base positions on reads.
.TP
.B -s, --output-MQ
Output mapping quality.
.TP
.B --output-QNAME
Output an extra column containing comma-separated read names.
.TP
.B -a
Output all positions, including those with zero depth.
.TP
.B -a -a, -aa
Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence
has coverage outside of the region specified in the BED file.
.PP
.B BAQ (Base Alignment Quality)
.PP
BAQ is the Phred-scaled probability of a read base being misaligned.
It greatly helps to reduce false SNPs caused by misalignments.
BAQ is calculated using the probabilistic realignment method described
in the paper \*(lqImproving SNP discovery by base alignment quality\*(rq,
Heng Li, Bioinformatics, Volume 27, Issue 8
<https://doi.org/10.1093/bioinformatics/btr076>

BAQ is turned on when a reference file is supplied using the
.B -f
option.  To disable it, use the
.B -B
option.

It is possible to store pre-calculated BAQ values in a SAM BQ:Z tag.
Samtools mpileup will use the precalculated values if it finds them.
The
.B -E
option can be used to make it ignore the contents of the BQ:Z tag and
force it to recalculate the BAQ scores by making a new alignment.
.RE

.TP \"-------- flags
.B flags
samtools flags INT|STR[,...]

Convert between textual and numeric flag representation.

.B FLAGS:
.TS
rb l l .
0x1	PAIRED	paired-end (or multiple-segment) sequencing technology
0x2	PROPER_PAIR	each segment properly aligned according to the aligner
0x4	UNMAP	segment unmapped
0x8	MUNMAP	next segment in the template unmapped
0x10	REVERSE	SEQ is reverse complemented
0x20	MREVERSE	SEQ of the next segment in the template is reverse complemented
0x40	READ1	the first segment in the template
0x80	READ2	the last segment in the template
0x100	SECONDARY	secondary alignment
0x200	QCFAIL	not passing quality controls
0x400	DUP	PCR or optical duplicate
0x800	SUPPLEMENTARY	supplementary alignment
.TE

.TP \"-------- fastq fasta
.B fastq/a
samtools fastq
.RI [ options ]
.I in.bam
.br
samtools fasta
.RI [ options ]
.I in.bam

Converts a BAM or CRAM into either FASTQ or FASTA format depending on the
command invoked. The files will be automatically compressed if the
file names have a .gz or .bgzf extension.

The input to this program must be collated by name.
Use
.B samtools collate
or
.B samtools sort -n
to ensure this.

For each different QNAME, the input records are categorised according to
the state of the READ1 and READ2 flag bits.
The three categories used are:

1 : Only READ1 is set.

2 : Only READ2 is set.

0 : Either both READ1 and READ2 are set; or neither is set.

The exact meaning of these categories depends on the sequencing technology
used.
It is expected that ordinary single and paired-end sequencing reads will be
in categories 1 and 2 (in the case of paired-end reads, one read of the pair
will be in category 1, the other in category 2).
Category 0 is essentially a \*(lqcatch-all\*(rq for reads that do not
fit into a simple paired-end sequencing model.

For each category only one sequence will be written for a given QNAME.
If more than one record is available for a given QNAME and category,
the first in input file order that has quality values will be used.
If none of the candidate records has quality values, then the first in
input file order will be used instead.

Sequences will be written to standard output unless one of the
.BR -1 , -2 ", or " -0
options is used, in which case sequences for that category will be written to
the specified file.

If a singleton file is specified using the
.B -s
option then only paired sequences will be output for categories 1 and 2;
paired meaning that for a given QNAME there are sequences for both
category 1
.B and
2.
If there is a sequence for only one of categories 1 or 2 then it will be
diverted into the specified singletons file.
This can be used to prepare fastq files for programs that cannot handle
a mixture of paired and singleton reads.

The
.B -s
option only affects category 1 and 2 records.
The output for category 0 will be the same irrespective of the use of this
option.

.B OPTIONS:
.RS
.TP 8
.B -n
By default, either '/1' or '/2' is added to the end of read names
where the corresponding READ1 or READ2 FLAG bit is set.
Using
.B -n
causes read names to be left as they are.
.TP 8
.B -N
Always add either '/1' or '/2' to the end of read names
even when put into different files.
.TP 8
.B -O
Use quality values from OQ tags in preference to standard quality string
if available.
.TP 8
.B -s FILE
Write singleton reads to FILE.
.TP 8
.B -t
Copy RG, BC and QT tags to the FASTQ header line, if they exist.
.TP 8
.B -T TAGLIST
Specify a comma-separated list of tags to copy to the FASTQ header line, if
they exist.
.TP 8
.B -1 FILE
Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of
outputting them.
If the
.B -s
option is used, only paired reads will be written to this file.
.TP 8
.B -2 FILE
Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of
outputting them.
If the
.B -s
option is used, only paired reads will be written to this file.
.TP 8
.B -0 FILE
Write reads where the READ1 and READ2 FLAG bits set are either both set
or both unset to FILE instead of outputting them.
.TP 8
.BI "-f " INT
Only output alignments with all bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
.TP 8
.BI "-F " INT
Do not output alignments with any bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
.TP 8
.BI "-G " INT
Only EXCLUDE reads with all of the bits set in
.I INT
present in the FLAG field.
.I INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
.TP 8
.B -i
add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
.TP 8
.B -c [0..9]
set compression level when writing gz or bgzf fastq files.
.TP 8
.B --i1 FILE
write first index reads to FILE
.TP 8
.B --i2 FILE
write second index reads to FILE
.TP 8
.B --barcode-tag TAG
aux tag to find index reads in [default: BC]
.TP 8
.B --quality-tag TAG
aux tag to find index quality in [default: QT]
.TP 8
.B --index-format STR
string to describe how to parse the barcode and quality tags. For example:

.RS
.TP 8
.B i14i8
the first 14 characters are index 1, the next 8 characters are index 2
.TP 8
.B n8i14
ignore the first 8 characters, and use the next 14 characters for index 1

If the tag contains a separator, then the numeric part can be replaced with '*' to
mean 'read until the separator or end of tag', for example:
.TP 8
.B n*i*
ignore the left part of the tag until the separator, then use the second part
.RE

.B EXAMPLES

Output paired reads to separate files, discarding singletons, supplementary
and secondary reads.
The resulting files can be used with, for example, the
.B bwa
aligner.
.EX 4
samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n -F 0x900 in.bam
.EE

Output paired and singleton reads in a single file, discarding supplementary
and secondary reads.
To get all of the reads in a single file, it is necessary to redirect the
output of samtools fastq.
The output file is suitable for use with
.B bwa mem -p
which understands interleaved files containing a mixture of paired and
singleton reads.
.EX 4
samtools fastq -0 /dev/null -F 0x900 in.bam > all_reads.fq
.EE

Output paired reads in a single file, discarding supplementary and
secondary reads.
Save any singletons in a separate file.
Append /1 and /2 to read names.
This format is suitable for use by
.B NextGenMap
when using its
.BR -p " and " -q " options."
With this aligner, paired reads must be mapped separately to the singletons.
.EX 4
samtools fastq -0 /dev/null -s single.fq -N -F 0x900 in.bam > paired.fq
.EE

.B BUGS

.IP o 2
The way of specifying output files is far to complicated and easy to get wrong.

.IP o 2
The default value for the -F option should really be 0x900 so that secondary
and supplementary reads are automatically excluded.
The existing default of 0 is retained for reasons of compatibility.

.RE

.TP \"-------- collate
.B collate
samtools collate
.RI [ options ]
.IR in.sam | in.bam | in.cram " [" <prefix> "]"

Shuffles and groups reads together by their names.
A faster alternative to a full query name sort,
.B collate
ensures that reads of the same name are grouped together in contiguous groups,
but doesn't make any guarantees about the order of read names between groups.

The output from this command should be suitable for any operation that
requires all reads from the same template to be grouped together.

If present, <prefix> is used to name the temporary files that collate
uses when sorting the data.  If neither the '-O' nor '-o' options are used,
<prefix> must be present and collate will use it to make an output file name
by appending a suffix depending on the format written (.bam by default).

If either the -O or -o option is used, <prefix> is optional.  If <prefix>
is absent, collate will write the temporary files to a system-dependent
location (/tmp on UNIX).

Using -f for fast mode will output \fBonly\fR primary alignments that have
either the READ1 \fBor\fR READ2 flags set (but not both).
Any other alignment records will be filtered out.
The collation will only work correctly if there are no more than two reads
for any given QNAME after filtering.

Fast mode keeps a buffer of alignments in memory so that it can write out
most pairs as soon as they are found instead of storing them in temporary
files.
This allows collate to avoid some work and so finish more quickly compared
to the standard mode.
The number of alignments held can be changed using -r, storing more alignments
uses more memory but increases the number of pairs that can be written early.

While collate normally randomises the ordering of read pairs, fast mode
does not.
Position-dependent biases that would normally be broken up can remain in the
fast collate output.
It is therefore not a good idea to use fast mode when preparing data for
programs that expect randomly ordered paired reads.
For example using fast collate instead of the standard mode may lead to
significantly different results from aligners that estimate library insert
sizes on batches of reads.

.B Options:
.RS
.TP 8
.B -O
Output to stdout.  This option cannot be used with '-o'.
.TP
.B -o FILE
Write output to FILE.  This option cannot be used with '-O'.
.TP
.B -u
Write uncompressed BAM output
.TP
.BI "-l "  INT
Compression level.
[1]
.TP
.BI "-n " INT
Number of temporary files to use.
[64]
.TP
.B -f
Fast mode (primary alignments only).
.TP
.BI "-r " INT
Number of reads to store in memory (for use with -f).
[10000]
.RE

.TP \"-------- reheader
.B reheader
samtools reheader
.RB [ -iP ]
.I in.header.sam in.bam

Replace the header in
.I in.bam
with the header in
.IR in.header.sam .
This command is much faster than replacing the header with a
BAM\(->SAM\(->BAM conversion.

By default this command outputs the BAM or CRAM file to standard
output (stdout), but for CRAM format files it has the option to
perform an in-place edit, both reading and writing to the same file.
No validity checking is performed on the header, nor that it is suitable
to use with the sequence data itself.

.B OPTIONS:
.RS
.TP 8
.B -P, --no-PG
Do not generate an @PG header line.
.TP 8
.B -i, --in-place
Perform the header edit in-place, if possible.  This only works on CRAM
files and only if there is sufficient room to store the new header.
The amount of space available will differ for each CRAM file.
.RE

.TP \"-------- cat
.B cat
samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]

Concatenate BAMs or CRAMs. Although this works on either BAM or CRAM,
all input files must be the same format as each other. The sequence
dictionary of each input file must be identical, although this command
does not check this. This command uses a similar trick to
.B reheader
which enables fast BAM concatenation.

.B OPTIONS:
.RS
.TP 8
.BI "-b " FOFN
Read the list of input BAM or CRAM files from \fIFOFN\fR.  These are
concatenated prior to any files specified on the command line.
Multiple \fB-b\fR \fIFOFN\fR options may be specified to concatenate
multiple lists of BAM/CRAM files.
.TP 8
.BI "-h " FILE
Uses the SAM header from \fIFILE\fR.  By default the header is taken
from the first file to be concatenated.
.TP 8
.BI "-o " FILE
Write the concatenated output to \fIFILE\fR.  By default this is sent
to stdout.
.RE

.TP \"-------- rmdup
.B rmdup
samtools rmdup [-sS] <input.srt.bam> <out.bam>

.B This command is obsolete.  Use markdup instead.

Remove potential PCR duplicates: if multiple read pairs have identical
external coordinates, only retain the pair with highest mapping quality.
In the paired-end mode, this command
.B ONLY
works with FR orientation and requires ISIZE is correctly set. It does
not work for unpaired reads (e.g. two ends mapped to different
chromosomes or orphan reads).

.B OPTIONS:
.RS
.TP 8
.B -s
Remove duplicates for single-end reads. By default, the command works for
paired-end reads only.
.TP 8
.B -S
Treat paired-end reads and single-end reads.
.RE

.TP \"-------- addreplacerg
.B addreplacerg
samtools addreplacerg [-r rg line | -R rg ID] [-m mode] [-l level] [-o out.bam]
<input.bam>

Adds or replaces read group tags in a file.

.B OPTIONS:
.RS
.TP 8
.BI "-r " STRING
Allows you to specify a read group line to append to the header and applies it
to the reads specified by the -m option. If repeated it automatically adds in
tabs between invocations.
.TP 8
.BI "-R " STRING
Allows you to specify the read group ID of an existing @RG line and applies it
to the reads specified.
.TP 8
.BI "-m " MODE
If you choose orphan_only then existing RG tags are not overwritten, if you choose
overwrite_all, existing RG tags are overwritten. The default is overwrite_all.
.TP 8
.BI "-o " STRING
Write the final output to STRING. The default is to write to stdout.

By default, samtools tries to select a format based on the output
filename extension; if output is to standard output or no format can be
deduced,
.B bam
is selected.
.RE

.TP \"-------- calmd
.B calmd
samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta>

Generate the MD tag. If the MD tag is already present, this command will
give a warning if the MD tag generated is different from the existing
tag. Output SAM by default.

Calmd can also read and write CRAM files although in most cases it is
pointless as CRAM recalculates MD and NM tags on the fly.  The one
exception to this case is where both input and output CRAM files
have been / are being created with the \fIno_ref\fR option.

.B OPTIONS:
.RS
.TP 8
.B -A
When used jointly with
.B -r
this option overwrites the original base quality.
.TP 8
.B -e
Convert a the read base to = if it is identical to the aligned reference
base. Indel caller does not support the = bases at the moment.
.TP
.B -u
Output uncompressed BAM
.TP
.B -b
Output compressed BAM
.TP
.BI -C \ INT
Coefficient to cap mapping quality of poorly mapped reads. See the
.B pileup
command for details. [0]
.TP
.B -r
Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
.TP
.B -E
Extended BAQ calculation. This option trades specificity for sensitivity, though the
effect is minor.
.RE

.TP \"-------- targetcut
.B targetcut
samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>

This command identifies target regions by examining the continuity of read depth, computes
haploid consensus sequences of targets and outputs a SAM with each sequence corresponding
to a target. When option
.B -f
is in use, BAQ will be applied. This command is
.B only
designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].

.TP \"-------- phase
.B phase
samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>

Call and phase heterozygous SNPs.

.B OPTIONS:
.RS
.TP 8
.B -A
Drop reads with ambiguous phase.
.TP 8
.BI -b \ STR
Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file
.BR STR .0.bam
and phase-1 reads in
.BR STR .1.bam.
Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads
with switch errors will be saved in
.BR STR .chimeric.bam.
[null]
.TP
.B -F
Do not attempt to fix chimeric reads.
.TP
.BI -k \ INT
Maximum length for local phasing. [13]
.TP
.BI -q \ INT
Minimum Phred-scaled LOD to call a heterozygote. [40]
.TP
.BI -Q \ INT
Minimum base quality to be used in het calling. [13]
.RE

.TP \"-------- depad
.B depad
samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam>

Converts a BAM aligned against a padded reference to a BAM aligned
against the depadded reference.  The padded reference may contain
verbatim "*" bases in it, but "*" bases are also counted in the
reference numbering.  This means that a sequence base-call aligned
against a reference "*" is considered to be a cigar match ("M" or "X")
operator (if the base-call is "A", "C", "G" or "T").  After depadding
the reference "*" bases are deleted and such aligned sequence
base-calls become insertions.  Similarly transformations apply for
deletions and padding cigar operations.

.B OPTIONS:
.RS
.TP
.B -S
Ignored for compatibility with previous samtools versions.
Previously this option was required if input was in SAM format, but now the
correct format is automatically detected by examining the first few characters
of input.
.TP
.B -s
Output in SAM format.  The default is BAM.
.TP
.B -C
Output in CRAM format.  The default is BAM.
.TP
.B -u
Do not compress the output.  Applies to either BAM or CRAM output
format.
.TP
.B -1
Enable fastest compression level.  Only works for BAM or CRAM output.
.TP
.BI "-T " FILE
Provides the padded reference file.  Note that without this the @SQ
line lengths will be incorrect, so for most use cases this option will
be considered as mandatory.
.TP
.BI "-o " FILE
Specifies the output filename.  By default output is sent to stdout.
.RE

.TP \"-------- markdup
.B markdup
.na
samtools markdup
.RB [ -l
.IR length ]
.RB [ -r ]
.RB [ -s ]
.RB [ -T ]
.RB [ -S ]
.I in.algsort.bam out.bam
.ad

Mark duplicate alignments from a coordinate sorted file that
has been run through fixmate with the -m option.  This program
relies on the MC and ms tags that fixmate provides.

.B
.RS
.TP 11
.BI "-l " INT
.RI "Expected maximum read length of " INT " bases."
[300]
.TP
.B -r
Remove duplicate reads.
.TP
.B -s
Print some basic stats.
.TP
.BI "-T " PREFIX
Write temporary files to
.IB PREFIX . samtools . nnnn . mmmm . tmp
.TP
.B -S
Mark supplementary reads of duplicates as duplicates.
.RE

.EX 4
.B EXAMPLE

# The first sort can be omitted if the file is already name ordered
samtools sort -n -o namesort.bam example.bam

# Add ms and MC tags for markdup to use later
samtools fixmate -m namesort.bam fixmate.bam

# Markdup needs position order
samtools sort -o positionsort.bam fixmate.bam

# Finally mark duplicates
samtools markdup positionsort.bam markdup.bam
.EE
.TP \"-------- help etc
.BR help ,\  --help
Display a brief usage message listing the samtools commands available.
If the name of a command is also given, e.g.,
.BR samtools\ help\ view ,
the detailed usage message for that particular command is displayed.

.TP
.B --version
Display the version numbers and copyright information for samtools and
the important libraries used by samtools.

.TP
.B --version-only
Display the full samtools version number in a machine-readable format.
.PP
.SH GLOBAL OPTIONS
.PP
Several long-options are shared between multiple samtools subcommands:
\fB--input-fmt\fR, \fB--input-fmt-option\fR, \fB--output-fmt\fR,
\fB--output-fmt-option\fR, and \fB--reference\fR.
The input format is typically auto-detected so specifying the format
is usually unnecessary and the option is included for completeness.
Note that not all subcommands have all options.  Consult the subcommand
help for more details.
.PP
Format strings recognised are "sam", "bam" and "cram".  They may be
followed by a comma separated list of options as \fIkey\fR or
\fIkey\fR=\fIvalue\fR. See below for examples.
.PP
The \fBfmt-option\fR arguments accept either a single \fIoption\fR or
\fIoption\fR=\fIvalue\fR.  Note that some options only work on some
file formats and only on read or write streams.  If value is
unspecified for a boolean option, the value is assumed to be 1.  The
valid options are as follows.
.RS 0
.\" General purpose
.TP 4
.BI level= INT
Output only. Specifies the compression level from 1 to 9, or 0 for
uncompressed.
.TP
.BI nthreads= INT
Specifies the number of threads to use during encoding and/or
decoding.  For BAM this will be encoding only.  In CRAM the threads
are dynamically shared between encoder and decoder.
.\" CRAM specific
.TP
.BI reference= fasta_file
Specifies a FASTA reference file for use in CRAM encoding or decoding.
It usually is not required for decoding except in the situation of the
MD5 not being obtainable via the REF_PATH or REF_CACHE environment variables.
.TP
.BI decode_md= 0|1
CRAM input only; defaults to 1 (on).  CRAM does not typically store
MD and NM tags, preferring to generate them on the fly.  This option
controls this behaviour.  It can be particularly useful when combined
with a file encoded using store_md=1 and store_nm=1.
.TP
.BI store_md= 0|1
CRAM output only; defaults to 0 (off).  CRAM normally only stores MD
tags when no reference is unknown and lets the decoder generate these
values on-the-fly (see decode_md).
.TP
.BI store_nm= 0|1
CRAM output only; defaults to 0 (off).  CRAM normally only stores NM
tags when no reference is unknown and lets the decoder generate these
values on-the-fly (see decode_md).
.TP
.BI ignore_md5= 0|1
CRAM input only; defaults to 0 (off).  When enabled, md5 checksum
errors on the reference sequence and block checksum errors within CRAM
are ignored.  Use of this option is strongly discouraged.
.TP
.BI required_fields= bit-field
CRAM input only; specifies which SAM columns need to be populated.
By default all fields are used.  Limiting the decode to specific
columns can have significant performance gains.  The bit-field is a
numerical value constructed from the following table.
.TS
center;
rb l .
0x1	SAM_QNAME
0x2	SAM_FLAG
0x4	SAM_RNAME
0x8	SAM_POS
0x10	SAM_MAPQ
0x20	SAM_CIGAR
0x40	SAM_RNEXT
0x80	SAM_PNEXT
0x100	SAM_TLEN
0x200	SAM_SEQ
0x400	SAM_QUAL
0x800	SAM_AUX
0x1000	SAM_RGAUX
.TE
.TP
.BI name_prefix= string
CRAM input only; defaults to output filename.  Any sequences with
auto-generated read names will use \fIstring\fR as the name prefix.
.TP
.BI multi_seq_per_slice= 0|1
CRAM output only; defaults to 0 (off).  By default CRAM generates one
container per reference sequence, except in the case of many small
references (such as a fragmented assembly).
.TP
.BI version= major.minor
CRAM output only.  Specifies the CRAM version number.  Acceptable
values are "2.1" and "3.0".
.TP
.BI seqs_per_slice= INT
CRAM output only; defaults to 10000.
.TP
.BI slices_per_container= INT
CRAM output only; defaults to 1.  The effect of having multiple slices
per container is to share the compression header block between
multiple slices.  This is unlikely to have any significant impact
unless the number of sequences per slice is reduced.  (Together these
two options control the granularity of random access.)
.TP
.BI embed_ref= 0|1
CRAM output only; defaults to 0 (off).  If 1, this will store portions
of the reference sequence in each slice, permitting decode without
having requiring an external copy of the reference sequence.
.TP
.BI no_ref= 0|1
CRAM output only; defaults to 0 (off).  If 1, sequences will be stored
verbatim with no reference encoding.  This can be useful if no
reference is available for the file.
.TP
.BI use_bzip2= 0|1
CRAM output only; defaults to 0 (off).  Permits use of bzip2 in CRAM
block compression.
.TP
.BI use_lzma= 0|1
CRAM output only; defaults to 0 (off).  Permits use of lzma in CRAM
block compression.
.TP
.BI lossy_names= 0|1
CRAM output only; defaults to 0 (off).  If 1, templates with all
members within the same CRAM slice will have their read names
removed.  New names will be automatically generated during decoding.
Also see the \fBname_prefix\fR option.
.RE
.PP
For example:
.EX 4
samtools view --input-fmt-option decode_md=0
    --output-fmt cram,version=3.0 --output-fmt-option embed_ref
    --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
.EE
.PP
.SH REFERENCE SEQUENCES
.PP
The CRAM format requires use of a reference sequence for both reading
and writing.
.PP
When reading a CRAM the \fB@SQ\fR headers are interrogated to identify
the reference sequence MD5sum (\fBM5:\fR tag) and the local reference
sequence filename (\fBUR:\fR tag).  Note that \fIhttp://\fR and
\fIftp://\fR based URLs in the UR: field are not used, but local fasta
filenames (with or without \fIfile://\fR) can be used.
.PP
To create a CRAM the \fB@SQ\fR headers will also be read to identify
the reference sequences, but M5: and UR: tags may not be present. In
this case the \fB-T\fR and \fB-t\fR options of samtools view may be
used to specify the fasta or fasta.fai filenames respectively
(provided the .fasta.fai file is also backed up by a .fasta file).
.PP
The search order to obtain a reference is:
.IP 1. 3
Use any local file specified by the command line options (eg -T).
.IP 2. 3
Look for MD5 via REF_CACHE environment variable.
.IP 3. 3
Look for MD5 in each element of the REF_PATH environment variable.
.IP 4. 3
Look for a local file listed in the UR: header tag.
.PP
.SH ENVIRONMENT VARIABLES
.PP
.TP
.B HTS_PATH
A colon-separated list of directories in which to search for HTSlib plugins.
If $HTS_PATH starts or ends with a colon or contains a double colon (\fB::\fP),
the built-in list of directories is searched at that point in the search.

If no HTS_PATH variable is defined, the built-in list of directories
specified when HTSlib was built is used, which typically includes
\fB/usr/local/libexec/htslib\fP and similar directories.

.TP
.B REF_PATH
A colon separated (semi-colon on Windows) list of locations in which
to look for sequences identified by their MD5sums.  This can be either
a list of directories or URLs. Note that if a URL is included then the
colon in http:// and ftp:// and the optional port number will be
treated as part of the URL and not a PATH field separator.
For URLs, the text \fB%s\fR will be replaced by the MD5sum being
read.

If no REF_PATH has been specified it will default to
\fBhttp://www.ebi.ac.uk/ena/cram/md5/%s\fR and if REF_CACHE is also unset,
it will be set to \fB$XDG_CACHE_HOME/hts-ref/%2s/%2s/%s\fR.
If \fB$XDG_CACHE_HOME\fR is unset, \fB$HOME/.cache\fR (or a local system
temporary directory if no home directory is found) will be used similarly.

.TP
.B REF_CACHE
This can be defined to a single directory housing a local cache of
references.  Upon downloading a reference it will be stored in the
location pointed to by REF_CACHE.  When reading a reference it will be
looked for in this directory before searching REF_PATH.  To avoid many
files being stored in the same directory, a pathname may be
constructed using %\fInum\fRs and %s notation, consuming \fInum\fR
characters of the MD5sum.  For example
\fB/local/ref_cache/%2s/%2s/%s\fR will create 2 nested subdirectories
with the filenames in the deepest directory being the last 28
characters of the md5sum.

The REF_CACHE directory will be searched for before attempting to load
via the REF_PATH search list.  If no REF_PATH is defined, both
REF_PATH and REF_CACHE will be automatically set (see above), but if
REF_PATH is defined and REF_CACHE not then no local cache is used.

To aid population of the REF_CACHE directory a script
\fBmisc/seq_cache_populate.pl\fR is provided in the Samtools
distribution. This takes a fasta file or a directory of fasta files
and generates the MD5sum named files.
.PP
.SH EXAMPLES
.IP o 2
Import SAM to BAM when
.B @SQ
lines are present in the header:
.EX 2
samtools view -bS aln.sam > aln.bam
.EE
If
.B @SQ
lines are absent:
.EX 2
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
.EE
where
.I ref.fa.fai
is generated automatically by the
.B faidx
command.

.IP o 2
Convert a BAM file to a CRAM file using a local reference sequence.
.EX 2
samtools view -C -T ref.fa aln.bam > aln.cram
.EE
.IP o 2
Attach the
.B RG
tag while merging sorted alignments:
.EX 2
perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt
samtools merge -rh rg.txt merged.bam ga.bam 454.bam
.EE
The value in a
.B RG
tag is determined by the file name the read is coming from. In this
example, in the
.IR merged.bam ,
reads from
.I ga.bam
will be attached
.IR RG:Z:ga ,
while reads from
.I 454.bam
will be attached
.IR RG:Z:454 .

.IP o 2
Convert a BAM file to a CRAM with NM and MD tags stored verbatim
rather than calculating on the fly during CRAM decode, so that mixed
data sets with MD/NM only on some records, or NM calculated using
different definitions of mismatch, can be decoded without change.  The
second command demonstrates how to decode such a file.  The request to
not decode MD here is turning off auto-generation of both MD and NM;
it will still emit the MD/NM tags on records that had these stored
verbatim.
.EX 2
samtools view -C --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o aln.cram aln.bam
samtools view --input-fmt-option decode_md=0 -o aln.new.bam aln.cram
.EE
.IP o 2
An alternative way of achieving the above is listing multiple options
after the \fB--output-fmt\fR or \fB-O\fR option.  The commands below
are equivalent to the two above.
.EX 2
samtools view -O cram,store_md=1,store_nm=1 -o aln.cram aln.bam
samtools view --input-fmt cram,decode_md=0 -o aln.new.bam aln.cram
.EE

.IP o 2
Call SNPs and short INDELs:
.EX 2
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf  > var.flt.vcf
.EE
The
.B bcftools filter
command marks low quality sites and sites with the read depth exceeding
a limit, which should be adjusted to about twice the average read depth
(bigger read depths usually indicate problematic regions which are
often enriched for artefacts).  One may consider to add
.B -C50
to
.B mpileup
if mapping quality is overestimated for reads containing excessive
mismatches. Applying this option usually helps
.B BWA-short
but may not other mappers.

Individuals are identified from the
.B SM
tags in the
.B @RG
header lines. Individuals can be pooled in one alignment file; one
individual can also be separated into multiple files. The
.B -P
option specifies that indel candidates should be collected only from
read groups with the
.B @RG-PL
tag set to
.IR ILLUMINA .
Collecting indel candidates from reads sequenced by an indel-prone
technology may affect the performance of indel calling.

.IP o 2
Generate the consensus sequence for one diploid individual:
.EX 2
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
.EE
.IP o 2
Phase one individual:
.EX 2
samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
.EE
The
.B calmd
command is used to reduce false heterozygotes around INDELs.


.IP o 2
Dump BAQ applied alignment for other SNP callers:
.EX 2
samtools calmd -bAr aln.bam > aln.baq.bam
.EE
It adds and corrects the
.B NM
and
.B MD
tags at the same time. The
.B calmd
command also comes with the
.B -C
option, the same as the one in
.B pileup
and
.BR mpileup .
Apply if it helps.

.SH LIMITATIONS
.PP
.IP o 2
Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.
.IP o 2
Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan
reads or ends mapped to different chromosomes). If this is a concern,
please use Picard's MarkDuplicates which correctly handles these cases,
although a little slower.

.SH AUTHOR
.PP
Heng Li from the Sanger Institute wrote the original C version of samtools.
Bob Handsaker from the Broad Institute implemented the BGZF library.
James Bonfield from the Sanger Institute developed the CRAM implementation.
John Marshall and Petr Danecek contribute to the source code and various
people from the 1000 Genomes Project have contributed to the SAM format
specification.

.SH SEE ALSO
.IR bcftools (1),
.IR sam (5),
.IR tabix (1)
.PP
Samtools website: <http://www.htslib.org/>
.br
File format specification of SAM/BAM,CRAM,VCF/BCF: <http://samtools.github.io/hts-specs>
.br
Samtools latest source: <https://github.com/samtools/samtools>
.br
HTSlib latest source: <https://github.com/samtools/htslib>
.br
Bcftools website: <http://samtools.github.io/bcftools>