\documentclass{article} \parskip 6pt \usepackage[margin=1.25in]{geometry} \usepackage[colorlinks=true,urlcolor=blue]{hyperref} %\VignetteIndexEntry{Computing Summary Statistics for Daily Data} %\VignetteDepends{smwrBase} \begin{document} \SweaveOpts{concordance=TRUE} \raggedright \title{Computing Summary Statistics for Daily Data} \author{Dave Lorenz} \maketitle \begin{abstract} These examples demonstrate how to compute selected summary statistics for daily streamflow data. The examples can easily be extended to other statistics or data types. \end{abstract} \tableofcontents \eject \section{Introduction} These examples use data from the smwrData package. The data are retrieved in the following code. <>= # Load the smwrBase and smwrData packages library(smwrBase) library(smwrData) # Retrieve streamflow data for the Choptank River near Greensboro, Maryland data(ChoptankFlow) # Print the first and last few rows of the data head(ChoptankFlow) tail(ChoptankFlow) # Check for missing values with(ChoptankFlow, screenData(datetime, Flow, year = "calendar")) @ \eject \section{Computing Daily Mean Values} The simplest and most straightforward way to compute summary statistics from arbitrarily grouped data is to use the \texttt{tapply} function. At its simplest, it requires only three arguments---\texttt{X}, the data to summarize; \texttt{INDEX}, the grouping data; and \texttt{FUN}, the summary statistic function. The \texttt{smwrBase} package contains the \texttt{baseDay} function that can be used to group data by day, so that all data for each day, including February 29, can be summarized. The output can be arranged so that the sequence represents the calendar year, water year, or climate year, beginning January 1, October 1, or April 1, respectively. The following script demonstrates how to use the \texttt{tapply} and \texttt{baseDay} functions to compute the daily mean streamflow for the previously retrieved data. It uses the \texttt{with} function to facilitate referring to columns in the dataset. <>= # There are no missing values, so only need the basic # 3 arguments for tapply ChoptankFlow.daily <- with(ChoptankFlow, tapply(Flow, baseDay(datetime, numeric=FALSE, year="calendar"), mean)) # Print the first and last few values of the output head(ChoptankFlow.daily) tail(ChoptankFlow.daily) @ The output from \texttt{tapply} is an array. Because the output from the \texttt{baseDay} function is an array of one dimension, it is printed in the form of a named vector. Different summary statistic functions will produce different outputs; for example, if the summary function had been \texttt{quantile}, the output would have been a list. The \texttt{tapply} function is very powerful and easy to use; however, there are times when we want the output in the form of a dataset rather than a vector or array. In those cases, the \texttt{aggregate} function is a better alternative than the \texttt{tapply} function. The \texttt{aggregate} function has several usage options. The script below demonstrates how to build a formula to compute the same statistics that we computed in the previous script. Early versions of \texttt{aggregate} required the output of the summary statistic function to be a scalar, but that is no longer a limitation. <>= # There are no missing values ChoptankFlow.dailyDF <- aggregate(Flow ~ baseDay(datetime, numeric=FALSE, year="calendar"), data=ChoptankFlow, FUN=mean) # Print the first and last few values of the output head(ChoptankFlow.dailyDF) tail(ChoptankFlow.dailyDF) # Rename the grouping column names(ChoptankFlow.dailyDF)[1] <- "Day" @ Note that the grouping column, renamed Day in the last line of code, is a factor. If character data are needed, executing the expression:\newline \texttt{ChoptankFlow.dailyDF\$Day <- as.character(ChoptankFlow.dailyDF\$Day)}\newline will convert the column Day to character. \eject \section{Computing Annual Mean Values} The previous example can easily be expanded to any grouping. This example computes annual means by calendar year. The \texttt{year} function in \texttt{lubridate} is used to group the data by calendar year, and the grouping column is renamed CalYear. The \texttt{waterYear} function in \texttt{smwrBase} can be used to group the data by water year. <>= # There are no missing values ChoptankFlow.yrDF <- aggregate(Flow ~ year(datetime), data=ChoptankFlow, FUN=mean) # Rename the grouping column names(ChoptankFlow.yrDF)[1] <- "CalYear" # Print the first few values of the output head(ChoptankFlow.yrDF) @ Other grouping functions include \texttt{month} (month) in \texttt{lubridate}, \texttt{seasons} (user-defined seasons) in \texttt{smwrBase}. Refer to the documentation for each of these functions for a description of the arguments. \eject \section{Computing Yearly and Monthly Mean Values} Aggregation can also be done by multiple grouping variables. This example computes the mean streamflow for each month by year. This example uses the \texttt{year} and the \texttt{month} functions because the output is sorted by groups. The sequence of the groups in the call is important---the sorting is done in the order specified in the formula. For this example, the data are sorted by month and then by year, which in this case, keeps the order correct; grouping by water year would misplace October, November, and December. For a calendar year table, the months are in the correct order. <>= # There are no missing values ChoptankFlow.my <- aggregate(Flow ~ month(datetime, label=TRUE) + year(datetime), data=ChoptankFlow, FUN=mean) # Rename columns 1 and 2 names(ChoptankFlow.my)[1:2] <- c("Month", "Year") # Print the first few values of the output head(ChoptankFlow.my) @ The output dataset may be used as is, or it could be restructured to a table of monthly values for each calendar year. To create a table by water year, the levels in the column Month must be reordered to begin in October and end in September. <>= # Restructure the dataset ChoptankFlow.myTbl <- group2row(ChoptankFlow.my, "Year", "Month", "Flow") # Print the first few values of the output, set width for Vignette options(width=70) head(ChoptankFlow.myTbl) @ Note that this example used the \texttt{group2row} function in \texttt{smwrBase}. The \texttt{reshape} function in \texttt{stats} and \texttt{stack} and \texttt{unstack} functions in \texttt{utils} are other functions that will restructure data. \end{document}