\documentclass{article} \parskip 6pt \usepackage[margin=1.25in]{geometry} \usepackage[colorlinks=true,urlcolor=blue]{hyperref} %\VignetteIndexEntry{Multivariate Methods} %\VignetteDepends{smwrQW} \begin{document} \SweaveOpts{concordance=TRUE} \raggedright \title{Multivariate Methods} \author{Dave Lorenz} \maketitle \begin{abstract} These examples demonstrate some of the functions that can be used to prepared water-quality data for multivariate analysis that are available in the \texttt{smwrQW} package. \end{abstract} \tableofcontents \eject \section{Introduction} The examples in this vignette use the DDT in fish dataset (table 13.1 in Helsel 2012). The examples in this vignette use the function \texttt{as.lcens} to convert those data to a form used by the functions demonstrated; the class ''lcens'' is most appropriate for these data as they are only left-censored and have only the value and an indicator of censoring. The functions demonstrated in these examples will also accept data of class ''qw.'' The R code following this paragraph creates the data and recodes the water-quality data to class ''lcens.'' There are many contributed packages that can be used for multivariate analysis; the examples in this vignette use only packages supplied with base R and are intended only to demonstrate the multivariate functionality of the \texttt{smwrQW} package. <>= # Load the smwrQW package library(smwrQW) # Create the dataset omitting Age, code as character DDT <- data.frame(Site=1:32, opDDD=c("<5","<5","5.3","<5","<5","<5","<5","<5", "<5","5.1","<5","<5","<5","5.1","<5","<5", "9","<5","9.8","<5","<5","5.1","8","<5", "<5","<5","5.7","<5","<5","<5","<5","<5"), ppDDD=c("<5","42","38","12","<5","<5","14","15", "12","39","5.7","9.4","18","27","10","7.6", "46","22","41","13","26","24","100","<5", "<5","<5","27","15","20","22","31","15"), opDDE=c("<5","8.4","<5","<5","<5","<5","<5","<5", "<5","<5","<5","<5","<5","<5","<5","<5", "<5","<5","6.9","<5","<5","250","8","<5", "<5","<5","<5","6","7.5","<5","<5","<5"), ppDDE=c("14","130","250","57","16","<5","52","48", "110","100","87","53","210","140","24","15", "110","51","50","66","110","38","160","23", "17","16","140","8.1","22","190","42","23"), opDDT=c("<5","<5","<5","<5","<5","<5","<5","<5", "<5","<5","18","<5","<5","<5","<5","<5", "<5","<5","<5","<5","<5","<5","<5","<5", "<5","<5","<5","<5","<5","5.2","<5","<5"), ppDDT=c("<5","31","11","<5","<5","<5","14","<5", "20","24","<5","7.3","30","33","5.8","10", "21","7.4","11","8.6","26","11","<5","<5", "<5","<5","14","<5","6.4","27","25","5.6"), stringsAsFactors=FALSE) # Convert concentrations to class "lcens" DDT <- transform(DDT, opDDD=as.lcens(opDDD), ppDDD=as.lcens(ppDDD), opDDE=as.lcens(opDDE), ppDDE=as.lcens(ppDDE), opDDT=as.lcens(opDDT), ppDDT=as.lcens(ppDDT)) # Load the libraries library(cluster) @ \eject \section{Binary Methods} The most conceptually simple method to prepare censored data for multivariate analysis recodes values as presence/absence or as the numerical values 1 and 0, respectively. The values are typically recoded based on the largest reporting limit, which works well for moderately and highly censored data. The recoding criterion can be set to any value greater than the largest reporting level, which can be useful for un- or lightly censored data. The recoded presence/absence, often referred to as 0/1 (reversing the sense of presence/absence), data usually converted into similarity or dissimilarity metrics. Helsel (2012) points out that many of the techniques for computing the similarity or dissimilarity metrics were developed for the biological sciences and are not appropriate for water-quality data because they downweight absence/absence pairs. The simple matching coefficient for computing dissimilarities described by Helsel (2012) can be computed using the \texttt{daisy} function in the \texttt{cluster} package and is illustrated in the R code following this paragraph. The distances between the first 10 sites are shown in the last line of R code, for illustration only. For example the 0.5 dissimilarity between site 1 and 2 indicates that one-half the values are the same (opDDD, ppDDE, and opDDT) and the other one-half are different. <>= # Compute the 0/1 values DDT01 <- with(DDT, code01(opDDD, ppDDD, opDDE, ppDDE, opDDT, ppDDT)) # Compute the dissimilarities using simple matching DDT.diss <- daisy(DDT01, metric="gower", type=list(symm=1:6)) # Show the first 10 rows of the 0/1 data and the distances head(DDT01, 10) print(as.dist(as.matrix(DDT.diss)[1:10,1:10]), digits=4) @ The distance matrix can be used to construct a hierarchical cluster analysis. The \texttt{stats} package in base R has the \texttt{hclust} function and the \texttt{cluster} package, supplied with base R has the \texttt{agnes} function that has more options than \texttt{hclust}. The example following this paragraph uses the \texttt{hclust} function. The details of the cluster merging sequence are printed to help understand the dendrogram. The cluster merging sequence has three columns, the first column is the left cluster number, the second column is the right cluster number and the third column is the height or distance between those cluster. For the left and right columns, negative values refer to the object number and positive values refer to the row number in the matrix. In this example the first cluster is formed from the merger of object 1 and 5 and the second cluster is formed from the merger of object 24 and the fist cluster. In both cases, the distances are 0 (no difference between the values). That pattern can be seen in the dendrogram. <>= # The cluster analysis DDT.hclust <- hclust(DDT.diss, method="average") # Details for the cluster merging sequence cbind(DDT.hclust$merge, DDT.hclust$height) # Plot the dendrogram setSweave("graph01", 6 ,6) # Create the graph, dendGram(DDT.hclust, ytitle="Average Merging Dissimilarity") graphics.off() @ \includegraphics{graph01.pdf} \paragraph{} \textbf{Figure 1.} The dendrogram from the binary method of distance calculation. \eject \section{Ordinal Methods} Gehan's u-score, described in Helsel (2012) is an affine transform of the rank for uncensored and single-reporting level censored values. The equation is u = (r-(n+1)/2)*2 or r = u/2 + (n+1)/2, where u is the u-score, r is the rank, and n is the number of observations. The advantage for the u-score is that is computable for multiple reporting levels and multiply censored data. The examples in this section use the u-score computed by the \texttt{codeU} function. The u-scores can be used directly in a principal component analysis or used to compute distances for a cluster analysis. The code following this paragraph computes the u-scores and the computes the Euclidean distance between sites. <>= # Compute the u scores DDTU <- with(DDT, codeU(opDDD, ppDDD, opDDE, ppDDE, opDDT, ppDDT)) # Compute the dissimilarities using simple matching DDTU.dist <- dist(DDTU) # Show the first 8 rows of the 0/1 data and the distances head(DDTU, 8) print(as.dist(as.matrix(DDTU.dist)[1:8,1:8]), digits=4) @ Hierarchical cluster analysis can be computed using either the \texttt{hclust} or \texttt{agnes} function. The R code following this paragraph uses the \texttt{hclust} function to compute the cluster analysis and uses the "average" method for merging clusters. For this example, the cluster merging sequence is not shown as there are no multiple observations separated by a distance of 0. But sites 5 and 26 have identical u-scores. <>= # The cluster analysis DDTU.hclust <- hclust(DDTU.dist, method="average") # Plot the dendrogram setSweave("graph02", 6 ,6) # Create the graph, dendGram(DDTU.hclust, ytitle="Average Merging Distance", xtitle="Site Number") graphics.off() @ \includegraphics{graph02.pdf} \paragraph{} \textbf{Figure 2.} The dendrogram from the ordinal method of distance calculation. Principal component analysis can be computed using either the \texttt{princomp} or \texttt{prcomp} function. The R code following this paragraph uses the \texttt{princomp} function to compute the analysis, but the \texttt{prcomp} function has more flexibility and numerical stability. <>= # The PCA. DDTU.pca <- princomp(DDTU, cor=TRUE) print(DDTU.pca) print(loadings(DDTU.pca), digits=4, cutoff=0) # Plot the PCA setSweave("graph03", 6 ,6) # Create the graph, AA.pl <- biPlot(DDTU.pca, Scale="distance", range.factor=1.1, obsPlotLabels=list(labels="none")) # Manually assign label direction to reduce ambiguity dir <- c("N","NE","NW","S","N","E","W","NE","E","E","NE","SW","NE","E","S","S", "N","NE","NE","NE","NE","NE","NW","SE","NW","SE","N","NE","NE","NE","NE","NW") with(AA.pl, labelPoints(x, y, labels=1:32, dir=dir, offset=0.5, current=AA.pl)) graphics.off() @ \includegraphics{graph03.pdf} \paragraph{} \textbf{Figure 3.} The biplot of the principal component analyses of the u-sores of the DDT data. The x-axis (Component 1) represents the total concentration, but primarily the pp-type compounds. Sites on the left-hand side have larger concentrations than those on the right-hand side. The y-axis (Component 2) represents whether opDDE or opDDT has the larger concentration. \eject \section{Imputation Methods} For relatively small percentages of left-censored water-quality data, substitute values can be estimated for left-censored values based on the overall structure of the data and the complete data used in multivariate procedures. The methods for estimating sensible substitution strategies for left-censored values are described by Palarea-Albaladejo (2013). The \texttt{zCompositions} package implements those methods. The functions \texttt{imputeLessThans} and \texttt{mImputeLessThans} in the \texttt{smwrQW} package can be used to estimate values for data of class "qw" or "lcens." The example in this section uses an altered version of the DDT dataset. Estimated values for the left-censored were synthesized to approximately maintain the correlations structure of the u-sores and the data were recensored at 2. However, the intent of the altered values was to illustrate the process of estimating values rather than produce a similar analysis. The R code immediately following this paragraph creates those data. <>= # Create the altered dataset, code as character DDTalt <- data.frame(Site=1:32, opDDD=c("<2","4","5.3","2","<2","<2","2.3","2.3", "2.1","5.1","<2","<2","2.8","5.1","<2","<2", "9","2.8","9.8","2.1","3.3","5.1","8","<2", "<2","<2","5.7","2.1","2.6","3","3.5","2.2"), ppDDD=c("3.1","42","38","12","3.1","2.4","14","15", "12","39","5.7","9.4","18","27","10","7.6", "46","22","41","13","26","24","100","3.2", "3.1","3.1","27","15","20","22","31","15"), opDDE=c("2.1","8.4","<2","<2","<2","4.9","<2","<2", "<2","<2","<2","<2","<2","<2","<2","<2", "<2","<2","6.9","<2","<2","250","8","<2", "<2","<2","<2","6","7.5","<2","<2","<2"), ppDDE=c("14","130","250","57","16","3.4","52","48", "110","100","87","53","210","140","24","15", "110","51","50","66","110","38","160","23", "17","16","140","8.1","22","190","42","23"), opDDT=c("<2","3.9","2.4","2","<2","<2","2.4","2", "2.5","2.6","18","2.2","2.6","2.7","2.1","2.2", "4.9","2.3","2.4","2.3","2.6","2.4","2.3","<2", "<2","<2","2.5","2","2.2","5.2","2.6","2.2"), ppDDT=c("2.2","31","11","3.2","2.3","<2","14","3.2", "20","24","3","7.3","30","33","5.8","10", "21","7.4","11","8.6","26","11","3.9","2.5", "2.3","2.3","14","2.6","6.4","27","25","5.6"), stringsAsFactors=FALSE) # Convert concentrations to class "lcens" DDTalt <- transform(DDTalt, opDDD=as.lcens(opDDD), ppDDD=as.lcens(ppDDD), opDDE=as.lcens(opDDE), ppDDE=as.lcens(ppDDE), opDDT=as.lcens(opDDT), ppDDT=as.lcens(ppDDT)) @ The \texttt{imputeLessThans} function can estimate sensible substitution values for left-censored values. The R code immediately following this paragraph uses that function to estimate complete data for the DDTal dataset. Those data are used in a principal component analysis, \texttt{prcomp} and the biplot is created. <>= # Impute the less-than values, multRepl required because opDDE is heavily censored DDTaltImp <- with(DDTalt, imputeLessThans(opDDD, ppDDD, opDDE, ppDDE, opDDT, ppDDT, initial="multRepl")) # Print alternate and imputed values head(DDTalt) head(DDTaltImp) # The PCA. DDTaltImp.pca <- prcomp(log(DDTaltImp), center=TRUE, scale=TRUE) print(DDTaltImp.pca) # Plot the PCA setSweave("graph04", 6 ,6) # Create the graph, biPlot(DDTaltImp.pca, Scale="distance", range.factor=1.1) graphics.off() @ \includegraphics{graph04.pdf} \paragraph{} \textbf{Figure 4.} The biplot of the principal component analyses of the of the DDTalt data. The x-axis (Component 1) represents the total concentration. Sites on the right-hand side have larger concentrations than those on the left-hand side. The y-axis (Component 2) represents the concentration of opDDE. Sites with negative scores have larger concentrations than those with positive scores. \begin{thebibliography}{9} \bibitem{H12} Helsel, D.R. 2012, Statistics for Censored Environmental Data Using Minitab and R: New York, Wiley, 324 p. \bibitem{HH} Helsel, D.R., and Hirsch, R.M., 2002, Statistical methods in water resources: U.S. Geological Survey Techniques of Water-Resources Investigations, book 4, chap. A3, 522 p. \bibitem{DL} Lorenz, D.L., 2015, smwrStats--An R package for the analysis of hydrologic data, Version 0.7.3: U.S. Geological Survey Open File Report, ? p. \bibitem{PA} Palarea-Albaladejo J. and Martin-Fernandez J.A., 2013, Values below detection limit in compositional chemical data: Analytica Chimica Acta 2013, v. 764, p. 32-43. accessed July 6, 2015 at https://doi.org/10.1016/j.aca.2012.12.029. \end{thebibliography} \end{document}