% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/DistToNearest.R
\name{distToNearest}
\alias{distToNearest}
\title{Distance to nearest neighbor}
\usage{
distToNearest(db, sequenceColumn = "JUNCTION", vCallColumn = "V_CALL",
  jCallColumn = "J_CALL", model = c("ham", "aa", "hh_s1f", "hh_s5f",
  "mk_rs1nf", "mk_rs5nf", "m1n_compat", "hs1f_compat"),
  normalize = c("len", "none"), symmetry = c("avg", "min"),
  first = TRUE, nproc = 1, fields = NULL, cross = NULL,
  mst = FALSE, subsample = NULL, progress = FALSE)
}
\arguments{
\item{db}{data.frame containing sequence data.}

\item{sequenceColumn}{name of the column containing nucleotide sequences to compare. 
Also used to determine sequence length for grouping.}

\item{vCallColumn}{name of the column containing the V-segment allele calls.}

\item{jCallColumn}{name of the column containing the J-segment allele calls.}

\item{model}{underlying SHM model, which must be one of 
\code{c("ham", "aa", "hh_s1f", "hh_s5f", "mk_rs1nf", "hs1f_compat", "m1n_compat")}.
See Details for further information.}

\item{normalize}{method of normalization. The default is \code{"len"}, which 
divides the distance by the length of the sequence group. If 
\code{"none"} then no normalization if performed.}

\item{symmetry}{if model is hs5f, distance between seq1 and seq2 is either the
average (avg) of seq1->seq2 and seq2->seq1 or the minimum (min).}

\item{first}{if \code{TRUE} only the first call of the gene assignments 
is used. if \code{FALSE} the union of ambiguous gene 
assignments is used to group all sequences with any 
overlapping gene calls.}

\item{nproc}{number of cores to distribute the function over.}

\item{fields}{additional fields to use for grouping.}

\item{cross}{character vector of column names to use for grouping to calculate 
distances across groups. Meaning the columns that define self versus others.}

\item{mst}{if \code{TRUE}, return comma-separated branch lengths from minimum 
spanning tree.}

\item{subsample}{number of sequences to subsample for speeding up pairwise-distance-matrix calculation. 
Subsampling is performed without replacement in each group of sequences with the 
same \code{vCallColumn}, \code{jCallColumn}, and junction length. 
If \code{subsample} is larger than the unique number of sequences in each group, 
then the subsampling process is ignored for that group. For each sequence in \code{db},
the reported \code{DIST_NEAREST} is the distance to the closest sequence in the
subsampled set for the group. If \code{NULL} no subsampling is performed.}

\item{progress}{if \code{TRUE} print a progress bar.}
}
\value{
Returns a modified \code{db} data.frame with nearest neighbor distances in the 
          \code{DIST_NEAREST} column if \code{cross=NULL}. 
          if \code{cross} was specified, distances will be added as the 
          \code{CROSS_DIST_NEAREST} column
}
\description{
Get non-zero distance of every sequence (as defined by \code{sequenceColumn}) to its 
nearest sequence sharing same V gene, J gene, and sequence length.
}
\details{
The distance to nearest neighbor can be used to estimate a threshold for assigning Ig
sequences to clonal groups. A histogram of the resulting vector is often bimodal, 
with the ideal threshold being a value that separates the two modes.

The following distance measures are accepted by the \code{model} parameter.

\itemize{
  \item \code{"ham"}:          Single nucleotide Hamming distance matrix from \link[alakazam]{getDNAMatrix} 
                               with gaps assigned zero distance.
  \item \code{"aa"}:           Single amino acid Hamming distance matrix from \link[alakazam]{getAAMatrix}.
  \item \code{"hh_s1f"}:       Human single nucleotide distance matrix derived from \link{HH_S1F} with 
                               \link{calcTargetingDistance}.
  \item \code{"hh_s5f"}:       Human 5-mer nucleotide context distance matix derived from \link{HH_S5F} with 
                               \link{calcTargetingDistance}.
  \item \code{"mk_rs1nf"}:     Mouse single nucleotide distance matrix derived from \link{MK_RS1NF} with 
                               \link{calcTargetingDistance}.
  \item \code{"mk_rs5nf"}:     Mouse 5-mer nucleotide context distance matrix derived from \link{MK_RS1NF} with 
                               \link{calcTargetingDistance}.
  \item \code{"hs1f_compat"}:  Backwards compatible human single nucleotide distance matrix used in 
                               SHazaM v0.1.4 and Change-O v0.3.3.
  \item \code{"m1n_compat"}:   Backwards compatibley mouse single nucleotide distance matrix used in 
                               SHazaM v0.1.4 and Change-O v0.3.3.
}

Note on \code{NA}s: if, for a given combination of V gene, J gene, and sequence length,
there is only 1 sequence (as defined by \code{sequenceColumn}), \code{NA} is returned 
instead of a distance (since it has no neighbor). If for a given combination there are 
multiple sequences but only 1 unique sequence, (in which case every sequence in this 
group is the de facto nearest neighbor to each other, thus giving rise to distances 
of 0), \code{NA}s are returned instead of zero-distances.

Note on \code{subsample}: Subsampling is performed independently in each group of sequences
sharing the same \code{vCallColumn}, \code{jCallColumn}, and junction length. If \code{subsample} 
is larger than number of sequences in the group, it is ignored. In other words, subsampling 
is performed only on groups of sequences of size equal to or greater than \code{subsample}. 
\code{DIST_NEAREST} has values calculated using all sequences in the group for groups of size
smaller than \code{subsample} and values calculated using a subset of sequences for the larger 
groups. To select a value of \code{subsample}, it can be useful to explore the group sizes in 
\code{db}.
}
\examples{
# Subset example data to one sample as a demo
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, SAMPLE == "-1h")

# Use genotyped V assignments, Hamming distance, and normalize by junction length
dist <- distToNearest(db, vCallColumn="V_CALL_GENOTYPED", model="ham", 
                      first=FALSE, normalize="len")
                           
# Plot histogram of non-NA distances
p1 <- ggplot(data=subset(dist, !is.na(DIST_NEAREST))) + 
      theme_bw() + 
      ggtitle("Distance to nearest: Hamming") + 
      xlab("distance") +
      geom_histogram(aes(x=DIST_NEAREST), binwidth=0.025, 
                     fill="steelblue", color="white")
plot(p1)

}
\references{
\enumerate{
  \item  Smith DS, et al. Di- and trinucleotide target preferences of somatic 
           mutagenesis in normal and autoreactive B cells. 
           J Immunol. 1996 156:2642-52. 
  \item  Glanville J, Kuo TC, von Budingen H-C, et al. 
           Naive antibody gene-segment frequencies are heritable and unaltered by 
           chronic lymphocyte ablation. 
           Proc Natl Acad Sci USA. 2011 108(50):20066-71.
  \item  Yaari G, et al. Models of somatic hypermutation targeting and substitution based 
           on synonymous mutations from high-throughput immunoglobulin sequencing data. 
           Front Immunol. 2013 4:358.
 }
}
\seealso{
See \link{calcTargetingDistance} for generating nucleotide distance matrices 
          from a \link{TargetingModel} object. See \link{HH_S5F}, \link{HH_S1F}, 
          \link{MK_RS1NF}, \link[alakazam]{getDNAMatrix}, and \link[alakazam]{getAAMatrix}
          for individual model details.
}
