Statistics - Predict 401DL Data Analysis Project Assignment
Question # 00219863
Posted By:
Updated on: 03/11/2016 02:17 PM Due on: 04/10/2016
Predict 401DL Data Analysis Project Assignment #2
Project Assignment #2 (100 points due the end of Session 10)
Overview
This assignment will investigate the extent to which physical measurements can be used to
devise a useful decision rule for abalone harvesting. Use your saved mydata file from the first
assignment for this assignment. Analysis of variance, linear regression and other forms of
analysis will be performed. Snippets of code are supplied for each of the steps in the assignment.
This report should comply with the report template, and be a separate report in its own right.
Results from the first assignment may be referenced as needed.
Report Template
Assignment #
(Enter your name)
Introduction:
The introduction should describe the purpose of the assignment. State the primary objective. It
should be clear to me that you understand the rationale for the assignment.
Results:
Your discussion should be intertwined with the results of your analysis appearing on or near the
page containing analytical results and data displays. Reports should be written to inform a
business reader, not overwhelm the reader with distracting detail. Present appropriate graphics
and tables. All graphics and tables should be identified by number and title for reference
purposes. Do not show unnecessary R output or R code. You should never include a printout of
the data set or extensive tables of summary statistics in the body of the report. Use the appendix
for this information. While graphical output may take up space, such output can be reduced in
size provided some important feature is not lost or obscured. Take a look at the self-check page
posted in the data analysis module. This page shows what the displays should look like.
Conclusions:
Summarize your results in an integrated and cohesive manner. State your conclusions succinctly
and clearly. Responses to the questions placed at the end of the assignment should appear here.
Submission:
The report should be submitted in pdf format. R code should not appear in the body of the
report. Attach your R code as an appendix. This should be all the code used for your report.
1
Predict 401DL Data Analysis Project Assignment #2
Use your saved mydata file from the first data analysis.
1) Write a function in R to calculate the Pearson chi square statistic on 2x2 contingency tables
which have the marginal totals. Use this function to test for independence of SHUCK and
VOLUME. Show the chi square value and p-value and discuss the results. What conclusion
results from rejection of the null hypothesis? What does this indicate about the relationship
between SHUCK and VOLUME?
a) An example of code to include in the function is shown below. The function would start
with a table that has marginal totals. The following statements calculate the expected
value for each cell in the table. Using these statements, the Pearson Chi Square Statistic
value may be calculated within the function and the value returned.
# Use with 2x2 contingency tables that have margins added.
e11 <- x[3,1]*x[1,3]/x[3,3]
e12 <- x[3,2]*x[1,3]/x[3,3]
e21 <- x[3,1]*x[2,3]/x[3,3]
e22 <- x[3,2]*x[2,3]/x[3,3]
b) To dichotomize SHUCK and VOLUME use statements similar to this:
shuck <- factor(mydata$SHUCK > median(mydata$SHUCK),
labels=c(“below”,”above”))
c) To generate a table use shuck_volume <- addmargins(table(shuck,volume)) This would
generate a table which could be submitted to the user-supplied chi-square function.
d) Use pchisq() to compute a p-value based on the computed quantile q from your function.
2) Perform an analysis of variance with aov() on SHUCK using CLASS and SEX as the
independent variables. Assume equal variances. Perform two analyses. First use the model
with an interaction term CLASS*SEX and then a model without CLASS*SEX. Use
summary() to obtain the resulting analysis of variance table. Follow up with the
TukeyHSD() function using the analysis of variance model without CLASS*SEX. Interpret
the results. (TukeyHSD() will adjust for unequal sample sizes.). Comment on the results. To
what extent do these results suggest male and female abalones can be combined into a single
category labeled as “adults”?
3) Use ggplot2 to form a scatterplot of SHUCK versus VOLUME and a scatterplot of their
logarithms labeling the variables as L_SHUCK and the latter as L_VOLUME. Use color to
differentiate CLASS in the plots. Compare the two scatterplots. Where do the various
CLASS levels appear in the plots? What are the implications of the observed patterns
regarding harvesting of abalones?
a) ggplot2 must be installed from CRAN. Use library(ggplot2) prior to executing code.
b) Here is an example of what should be produced using ggplot.
2
Predict 401DL Data Analysis Project Assignment #2
4) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and SEX. Follow
the steps shown in Section 16.1 of Lander (located in library reserves on the course site), or
alternatively in the Data Analysis Video #2. Use the following multiple regression model:
L_SHUCK~L_VOLUME+CLASS+SEX. Apply summary() to the regression object to
display significance test results. Discuss your conclusions and respond to the questions:
a) In this situation, what advantage is there in using log transformed values for regression?
b) Is SEX an important predictor in this regression model?
c) What inferences can be drawn based on the coefficient estimates for CLASS levels?
(Hint: This question is not asking if the estimates are statistically significant. It is asking
for an interpretation of any pattern in these coefficients.)
5) Perform an analysis of the residuals. If “out” is the regression object, use out$residuals and
construct a histogram and QQ plot. Compute the skewness and kurtosis. For a good fit, the
histogram of residuals should approximate a normal distribution. Describe the distribution of
residuals. What is revealed by the QQ plot of the residuals? Using ggplot, plot the residuals
versus L_VOLUME coloring the data points by CLASS, and a second time coloring the data
points by SEX. Use ggplot to present boxplots of the residuals by SEX and by CLASS.
What do the boxplots of the residuals reveal? Does the regression model fit the data?
a) The package “moments” will need to be installed on the computer.
b) Here is the type of code needed: ggplot(out, aes(x = L_VOLUME,y = out$residuals)) +
geom_point(aes(color = CLASS)) + labs(x = "L_VOLUME", y = "Residual")
3
Predict 401DL Data Analysis Project Assignment #2
6) There is a tradeoff faced in managing the abalone harvest. The infant population must be
protected since that represents future harvests. On the other hand, the harvest should be
designed to be efficient with a sufficient yield to justify the effort. VOLUME will be used
for a very simple decision rule. If an abalone VOLUME is below a specified “cutoff” (i.e.
specified volume), that individual will not be harvested. If above, it will be harvested.
a) Calculate the proportion of infant abalones which fall beneath a specified volume or
“cutoff”. A series of volumes covering the range from minimum to maximum abalone
volume will be used in a "for loop" to determine how the harvest proportion of infants
changes as the “cutoff” changes. Example code for doing this is supplied below.
idxi <- mydata[,1]=="I"
idxf <- mydata[,1]=="F"
idxm <- mydata[,1]=="M"
max.v <- max(mydata$VOLUME)
min.v <- min(mydata$VOLUME)
delta <- (max.v - min.v)/100
prop.infants <- numeric(0)
volume.value <- numeric(0)
total <- length(mydata[idxi,1]) # This value must be changed for adults.
for (k in 1:100)
{
value <- min.v + k*delta
volume.value[k] <- value
prop.infants[k] <- sum(mydata$VOLUME[idxi] <= value)/total
}
# prop.infants shows the impact of increasing the volume cutoff for harvesting.
# The following code shows how to "split" the population at a 50% harvest of infants.
n.infants <- sum(prop.infants <= 0.5)
split.infants <- min.v + (n.infants + 0.5)*delta # This estimates the desired volume.
plot(volume.value, prop.infants, col = "green", main = "Proportion of Infants Not
Harvested",
type = "l", lwd = 2)
abline(h=0.5)
abline(v = split.infants)
b) Modify the code. This time instead of counting infants, count adults. Present a plot
showing the adult proportions versus volume. Compute the 50% "split" volume value for
the adults and show on the plot similarly to the plot for infants.
4
Predict 401DL Data Analysis Project Assignment #2
It is essential that the males and females be combined into a single count as "adults" for
computing the proportion for "adults". Part #9) will require plotting of infants versus
adults. For this plotting to be accomplished, a "for loop", similar to the one above, may be
used to compute the adult harvest proportions. It must use the same value for the
constants min.v and delta. It must also use the statement “for (k in 1:100)”. Otherwise, the
resulting adult proportions cannot be directly compared to the infant proportions.
7) This part will address the determination of a volume.value that corresponds to the observed
maximum difference in harvest percentages of adults and infants. To calculate this result, the
proportions from #6) must be used. These proportions must be converted from "not
harvested" proportions to "harvested" proportions by using (1-prop.infants) for infants, and
(1-prop.adults) for adults. (The reason the proportion for infants drops sooner than adults, is
that infants are maturing and becoming adults with larger volumes.) From the plot generated,
determine the volume.value which corresponds to the maximum difference in “harvested”
proportions. Report it and the associated difference in proportions. (For brevity, we will
ignore the evident variability present in the peak of the plot you generate and pick a single
number. Curve smoothing algorithms go beyond the scope of this assignment.).
a) Present a plot of the difference ((1-prop.adults) – (1-prop.infants)) versus volume.value.
Use volume.value from #8.
b) Determine the volume.value which corresponds to the observed “peak” difference.
c) What harvest proportions for infants and adults would result if this volume.value is used
as a “cutoff” for decision making? Should other cutoffs be considered?
8) Construct an ROC curve by plotting (1-prop.adults) versus (1-prop.infants). Each point
which appears corresponds to a particular volume.value. The ROC curve illustrates the
tradeoffs involved with decision rules. An example of an ROC curve is shown below.
5
Predict 401DL Data Analysis Project Assignment #2
For abalone harvesting, if an infant is harvested according to a specified volume cutoff, that is a
false positive. The infant is being treated as an adult. The true positive rate is the proportion of
adults harvested. Your ROC curve should reveal how the true positive rate and false positive
rate change as the cutoff is changed. Note that with a large volume cutoff, few if any abalone
will be harvested. This would give a true positive rate close to zero percent and the same for the
false positive rate. Conversely, with a very small volume cutoff, the opposite would be the case.
Even some infant abalones would be harvested.
a) Find the smallest cutoff (i.e. volume.value) for which no infant is harvested (zero false
positives). Report this cutoff and the corresponding (1-prop.adults) value. Comment on
your findings. Does this seem to be a reasonable choice for a decision rule? Why?
b) How does this cutoff compare to the result determined in #7) above?
9) An additional calculation will be performed. Harvesting of infants in classes A1 and A2
must be minimized. With these data it is possible to find volume cutoffs for which this infant
harvest is zero. The minimum volume that does this is to be determined.
a) Find the minimum volume which produces a zero harvest of infant abalone in classes A1
and A2. This can be accomplished by substituting values into code comparable to what
is supplied below. A cutoff of 0.036 is used as an example. (Note: this minimum can be
determined to a greater level of precision using other coding methods. How you do this
is up to you.) More precision than shown in the example below is not necessary.
b) Present a table showing the harvest levels by CLASS for the following cutoffs: 0.036,
0.035, 0.034. What proportion of adults is harvested with each of these cutoffs?
> cutoff <- 0.036 #(Note: smaller values will need to be checked.)
> index.A1 <- (mydata$CLASS=="A1")
> indexi <- index.A1 & idxi
> sum(mydata[indexi,11] >= cutoff)/sum(index.A1) [1] 0
> index.A2 <- (mydata$CLASS=="A2")
> indexi <- index.A2 & idxi
> sum(mydata[indexi,11] >= cutoff)/sum(index.A2) [1] 0
10) Present a table of the cutoffs determined in #7), #8) and #9) with your interpretation. What
tradeoffs and considerations would you present to study investigators?
In your report conclusions, discuss what you have learned about abalone and the use of physical
measurements as a basis for abalone harvesting. How much reliance would you place on the
results you have obtained? What else might be done to verify these conclusions? If you have
specific harvesting recommendations, a strategy or alternative decision rule, discuss them. Are
there other approaches that may be easier to implement in the field? Discuss what you see as
difficulties in analyzing data from an observational study involving different classes or cohorts
of subjects. What cautions come to mind? What can be learned from such studies?
6
Project Assignment #2 (100 points due the end of Session 10)
Overview
This assignment will investigate the extent to which physical measurements can be used to
devise a useful decision rule for abalone harvesting. Use your saved mydata file from the first
assignment for this assignment. Analysis of variance, linear regression and other forms of
analysis will be performed. Snippets of code are supplied for each of the steps in the assignment.
This report should comply with the report template, and be a separate report in its own right.
Results from the first assignment may be referenced as needed.
Report Template
Assignment #
(Enter your name)
Introduction:
The introduction should describe the purpose of the assignment. State the primary objective. It
should be clear to me that you understand the rationale for the assignment.
Results:
Your discussion should be intertwined with the results of your analysis appearing on or near the
page containing analytical results and data displays. Reports should be written to inform a
business reader, not overwhelm the reader with distracting detail. Present appropriate graphics
and tables. All graphics and tables should be identified by number and title for reference
purposes. Do not show unnecessary R output or R code. You should never include a printout of
the data set or extensive tables of summary statistics in the body of the report. Use the appendix
for this information. While graphical output may take up space, such output can be reduced in
size provided some important feature is not lost or obscured. Take a look at the self-check page
posted in the data analysis module. This page shows what the displays should look like.
Conclusions:
Summarize your results in an integrated and cohesive manner. State your conclusions succinctly
and clearly. Responses to the questions placed at the end of the assignment should appear here.
Submission:
The report should be submitted in pdf format. R code should not appear in the body of the
report. Attach your R code as an appendix. This should be all the code used for your report.
1
Predict 401DL Data Analysis Project Assignment #2
Use your saved mydata file from the first data analysis.
1) Write a function in R to calculate the Pearson chi square statistic on 2x2 contingency tables
which have the marginal totals. Use this function to test for independence of SHUCK and
VOLUME. Show the chi square value and p-value and discuss the results. What conclusion
results from rejection of the null hypothesis? What does this indicate about the relationship
between SHUCK and VOLUME?
a) An example of code to include in the function is shown below. The function would start
with a table that has marginal totals. The following statements calculate the expected
value for each cell in the table. Using these statements, the Pearson Chi Square Statistic
value may be calculated within the function and the value returned.
# Use with 2x2 contingency tables that have margins added.
e11 <- x[3,1]*x[1,3]/x[3,3]
e12 <- x[3,2]*x[1,3]/x[3,3]
e21 <- x[3,1]*x[2,3]/x[3,3]
e22 <- x[3,2]*x[2,3]/x[3,3]
b) To dichotomize SHUCK and VOLUME use statements similar to this:
shuck <- factor(mydata$SHUCK > median(mydata$SHUCK),
labels=c(“below”,”above”))
c) To generate a table use shuck_volume <- addmargins(table(shuck,volume)) This would
generate a table which could be submitted to the user-supplied chi-square function.
d) Use pchisq() to compute a p-value based on the computed quantile q from your function.
2) Perform an analysis of variance with aov() on SHUCK using CLASS and SEX as the
independent variables. Assume equal variances. Perform two analyses. First use the model
with an interaction term CLASS*SEX and then a model without CLASS*SEX. Use
summary() to obtain the resulting analysis of variance table. Follow up with the
TukeyHSD() function using the analysis of variance model without CLASS*SEX. Interpret
the results. (TukeyHSD() will adjust for unequal sample sizes.). Comment on the results. To
what extent do these results suggest male and female abalones can be combined into a single
category labeled as “adults”?
3) Use ggplot2 to form a scatterplot of SHUCK versus VOLUME and a scatterplot of their
logarithms labeling the variables as L_SHUCK and the latter as L_VOLUME. Use color to
differentiate CLASS in the plots. Compare the two scatterplots. Where do the various
CLASS levels appear in the plots? What are the implications of the observed patterns
regarding harvesting of abalones?
a) ggplot2 must be installed from CRAN. Use library(ggplot2) prior to executing code.
b) Here is an example of what should be produced using ggplot.
2
Predict 401DL Data Analysis Project Assignment #2
4) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and SEX. Follow
the steps shown in Section 16.1 of Lander (located in library reserves on the course site), or
alternatively in the Data Analysis Video #2. Use the following multiple regression model:
L_SHUCK~L_VOLUME+CLASS+SEX. Apply summary() to the regression object to
display significance test results. Discuss your conclusions and respond to the questions:
a) In this situation, what advantage is there in using log transformed values for regression?
b) Is SEX an important predictor in this regression model?
c) What inferences can be drawn based on the coefficient estimates for CLASS levels?
(Hint: This question is not asking if the estimates are statistically significant. It is asking
for an interpretation of any pattern in these coefficients.)
5) Perform an analysis of the residuals. If “out” is the regression object, use out$residuals and
construct a histogram and QQ plot. Compute the skewness and kurtosis. For a good fit, the
histogram of residuals should approximate a normal distribution. Describe the distribution of
residuals. What is revealed by the QQ plot of the residuals? Using ggplot, plot the residuals
versus L_VOLUME coloring the data points by CLASS, and a second time coloring the data
points by SEX. Use ggplot to present boxplots of the residuals by SEX and by CLASS.
What do the boxplots of the residuals reveal? Does the regression model fit the data?
a) The package “moments” will need to be installed on the computer.
b) Here is the type of code needed: ggplot(out, aes(x = L_VOLUME,y = out$residuals)) +
geom_point(aes(color = CLASS)) + labs(x = "L_VOLUME", y = "Residual")
3
Predict 401DL Data Analysis Project Assignment #2
6) There is a tradeoff faced in managing the abalone harvest. The infant population must be
protected since that represents future harvests. On the other hand, the harvest should be
designed to be efficient with a sufficient yield to justify the effort. VOLUME will be used
for a very simple decision rule. If an abalone VOLUME is below a specified “cutoff” (i.e.
specified volume), that individual will not be harvested. If above, it will be harvested.
a) Calculate the proportion of infant abalones which fall beneath a specified volume or
“cutoff”. A series of volumes covering the range from minimum to maximum abalone
volume will be used in a "for loop" to determine how the harvest proportion of infants
changes as the “cutoff” changes. Example code for doing this is supplied below.
idxi <- mydata[,1]=="I"
idxf <- mydata[,1]=="F"
idxm <- mydata[,1]=="M"
max.v <- max(mydata$VOLUME)
min.v <- min(mydata$VOLUME)
delta <- (max.v - min.v)/100
prop.infants <- numeric(0)
volume.value <- numeric(0)
total <- length(mydata[idxi,1]) # This value must be changed for adults.
for (k in 1:100)
{
value <- min.v + k*delta
volume.value[k] <- value
prop.infants[k] <- sum(mydata$VOLUME[idxi] <= value)/total
}
# prop.infants shows the impact of increasing the volume cutoff for harvesting.
# The following code shows how to "split" the population at a 50% harvest of infants.
n.infants <- sum(prop.infants <= 0.5)
split.infants <- min.v + (n.infants + 0.5)*delta # This estimates the desired volume.
plot(volume.value, prop.infants, col = "green", main = "Proportion of Infants Not
Harvested",
type = "l", lwd = 2)
abline(h=0.5)
abline(v = split.infants)
b) Modify the code. This time instead of counting infants, count adults. Present a plot
showing the adult proportions versus volume. Compute the 50% "split" volume value for
the adults and show on the plot similarly to the plot for infants.
4
Predict 401DL Data Analysis Project Assignment #2
It is essential that the males and females be combined into a single count as "adults" for
computing the proportion for "adults". Part #9) will require plotting of infants versus
adults. For this plotting to be accomplished, a "for loop", similar to the one above, may be
used to compute the adult harvest proportions. It must use the same value for the
constants min.v and delta. It must also use the statement “for (k in 1:100)”. Otherwise, the
resulting adult proportions cannot be directly compared to the infant proportions.
7) This part will address the determination of a volume.value that corresponds to the observed
maximum difference in harvest percentages of adults and infants. To calculate this result, the
proportions from #6) must be used. These proportions must be converted from "not
harvested" proportions to "harvested" proportions by using (1-prop.infants) for infants, and
(1-prop.adults) for adults. (The reason the proportion for infants drops sooner than adults, is
that infants are maturing and becoming adults with larger volumes.) From the plot generated,
determine the volume.value which corresponds to the maximum difference in “harvested”
proportions. Report it and the associated difference in proportions. (For brevity, we will
ignore the evident variability present in the peak of the plot you generate and pick a single
number. Curve smoothing algorithms go beyond the scope of this assignment.).
a) Present a plot of the difference ((1-prop.adults) – (1-prop.infants)) versus volume.value.
Use volume.value from #8.
b) Determine the volume.value which corresponds to the observed “peak” difference.
c) What harvest proportions for infants and adults would result if this volume.value is used
as a “cutoff” for decision making? Should other cutoffs be considered?
8) Construct an ROC curve by plotting (1-prop.adults) versus (1-prop.infants). Each point
which appears corresponds to a particular volume.value. The ROC curve illustrates the
tradeoffs involved with decision rules. An example of an ROC curve is shown below.
5
Predict 401DL Data Analysis Project Assignment #2
For abalone harvesting, if an infant is harvested according to a specified volume cutoff, that is a
false positive. The infant is being treated as an adult. The true positive rate is the proportion of
adults harvested. Your ROC curve should reveal how the true positive rate and false positive
rate change as the cutoff is changed. Note that with a large volume cutoff, few if any abalone
will be harvested. This would give a true positive rate close to zero percent and the same for the
false positive rate. Conversely, with a very small volume cutoff, the opposite would be the case.
Even some infant abalones would be harvested.
a) Find the smallest cutoff (i.e. volume.value) for which no infant is harvested (zero false
positives). Report this cutoff and the corresponding (1-prop.adults) value. Comment on
your findings. Does this seem to be a reasonable choice for a decision rule? Why?
b) How does this cutoff compare to the result determined in #7) above?
9) An additional calculation will be performed. Harvesting of infants in classes A1 and A2
must be minimized. With these data it is possible to find volume cutoffs for which this infant
harvest is zero. The minimum volume that does this is to be determined.
a) Find the minimum volume which produces a zero harvest of infant abalone in classes A1
and A2. This can be accomplished by substituting values into code comparable to what
is supplied below. A cutoff of 0.036 is used as an example. (Note: this minimum can be
determined to a greater level of precision using other coding methods. How you do this
is up to you.) More precision than shown in the example below is not necessary.
b) Present a table showing the harvest levels by CLASS for the following cutoffs: 0.036,
0.035, 0.034. What proportion of adults is harvested with each of these cutoffs?
> cutoff <- 0.036 #(Note: smaller values will need to be checked.)
> index.A1 <- (mydata$CLASS=="A1")
> indexi <- index.A1 & idxi
> sum(mydata[indexi,11] >= cutoff)/sum(index.A1) [1] 0
> index.A2 <- (mydata$CLASS=="A2")
> indexi <- index.A2 & idxi
> sum(mydata[indexi,11] >= cutoff)/sum(index.A2) [1] 0
10) Present a table of the cutoffs determined in #7), #8) and #9) with your interpretation. What
tradeoffs and considerations would you present to study investigators?
In your report conclusions, discuss what you have learned about abalone and the use of physical
measurements as a basis for abalone harvesting. How much reliance would you place on the
results you have obtained? What else might be done to verify these conclusions? If you have
specific harvesting recommendations, a strategy or alternative decision rule, discuss them. Are
there other approaches that may be easier to implement in the field? Discuss what you see as
difficulties in analyzing data from an observational study involving different classes or cohorts
of subjects. What cautions come to mind? What can be learned from such studies?
6
-
Rating:
/5
Solution: Statistics - Predict 401DL Data Analysis Project Assignment