Case Study - SAS Doctoral Level, Framingham Heart Study: Data Preparation

Question # 00864713 Posted By: wildcraft Updated on: 12/10/2024 10:39 PM Due on: 12/11/2024
Subject Education Topic General Education Tutorials:
Question
Dot Image

SAS Data prepation

Case Study 231- SAS Doctoral Level

Framingham Heart Study: Data Preparation

Industry Aligned Activity

Purpose

This activity focuses on preparing data from the Framingham Heart Study for future statistical analyses as well as exploring data through descriptive statistics.

SAS Software

This activity can be performed using any SAS programming environment, including SAS Studio in SAS OnDemand for Academics.

Industry Alignment

This activity aligns with the healthcare industry. It uses data from a clinical study conducted to identify characteristics contributing to cardiovascular disease.

Table of Contents Framingham Heart Study: Data Preparation 1 Purpose 1 SAS Software 1 Industry Alignment 1 Activity Notes and Requirements 3 Learning Objectives 3 Estimated Completion Time 3 Experience Level 3 Prerequisite Knowledge 3 Software 3 Content Knowledge 3 Additional Notes 3 Data Source 3 Introduction 3 Description of Variables 4 Framingham Heart Study: Data Preparation Activity 5 Part 1: Understanding the Variables 5 Part 2: Creating New Variables and Subsetting the Data 8 Appendix 12 Appendix A: Access Software 12 Appendix B: Helpful Documentation 12 Appendix C: Recommended Learning 12

Framingham Heart Study: Data Preparation

Industry Applied Activity

1

Activity Notes and Requirements

Learning Objectives

This activity provides practice with skills such as:

· Implementing data changes and manipulations

· Preparing data for future possible statistical analyses

· Exploring data through descriptive statistics including:

· Understanding variables and their values within the data

· Recognizing the need for changes in the data

Estimated Completion Time

This activity will take students approximately 3 hours to complete.

Experience Level

To complete this activity students should have the following levels of experience:

· Intermediate skill in SAS programming

· Beginner skill in statistics

Prerequisite Knowledge

Software

Students should have experience with the following:

· Foundations of programming with the SAS Data Step including using functions and if/then/else conditional statements.

· SAS descriptive procedures such as PROC PRINT, PROC CONTENTS, PROC FREQ, PROC MEANS, and PROC UNIVARIATE.

Content Knowledge

Students should have experience/knowledge with the following concepts:

· Descriptive statistics such as mean, median, counts, and percentages

· Conditional if/then/else logic

Additional Notes

This activity pairs well with the following activities that you will complete::

· Framingham Heart Study: Descriptive Analysis, Industry Applied Activity

· Framingham Heart Study: Statistical Analysis, Industry Applied Activity

Data Source

Introduction

This activity uses the HEART dataset in the SASHELP library. To access the SASHELP library in SAS, select  View  Explorer. In the Explorer window, select  Libraries  Sashelp. The data came from the landmark Framingham Heart Study ( https://framinghamheartstudy.org/). The purpose of the Framingham Heart Study was to identify characteristics contributing to cardiovascular disease. Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data.

The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014—almost 7 decades!

The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams.

Description of Variables

The variables used for this exercise are:

Variable

Description

Status

Alive or dead

DeathCause

Cause of death

AgeCHDdiag

Age at which CHD was diagnosed

Sex

Male or female

AgeAtStart

Age at the entry into the Framingham Heart Study

Height

Height in inches

Weight

Weight in pounds

Diastolic

Diastolic blood pressure

Systolic

Systolic blood pressure

MRW

Metropolitan Relative Weight

Smoking

Number of packs of cigarettes smoked per week

AgeatDeath

Age at death

Cholesterol

Total cholesterol

Chol_Status

Total cholesterol categorized into groups

BP_Status

Diastolic and systolic blood pressure categorized into groups

Weight_Status

Height and weight categorized into groups

Smoking_Status

Number of packs of cigarettes smoked per week categorized into groups

 

Framingham Heart Study: Data Preparation Activity

This activity is comprised of two parts. Part one outlines how to explore the data to understand the variables for analysis. Part two outlines how to prepare the data for future analyses by creating new variables and subsetting the data.

Part 1: Understanding the Variables

Deciding an appropriate path for analysis often requires many steps. An important first step is exploring and examining the data. An initial exploratory data analysis provides understanding of the meaning of study variables and can provide crucial clues into data preparations needed before analyzing the data.

1. Open and examine the SASHELP.HEART dataset and its variables. Familiarize yourself with the context and meanings behind the variables and their values.

a. How many observations are in the dataset?

b. How many variables are in the dataset? How many are numeric? How many are character?

Exploring the assigned values of character variables can demonstrate patterns and inherent orderings. The default ordering of levels in SAS is alphabetical order. The levels of many character variables have an inherent ordering of magnitude. For example, non-smokers smoke less than light smokers who smoke less than moderate smokers.

2. Tabulate the levels of the character variables in the SASHELP.HEART dataset. For each of the character variables:

a. What data values or levels are observed for each?

b. Which variables have an inherent ordering of magnitude? Does alphabetical order of the levels correspond to ordering levels by magnitude for any of these character variables?

Examining the values of numeric variables can provide insights into their magnitude, spread, and symmetry. Variables with a symmetric distribution will have roughly equal mean and median, so can be summarized with either statistic. Variables with substantially different mean and median values indicate a non-symmetric distribution. Such variables may be better summarized with a median. Additionally, some numeric variables may have few unique values, so could be better summarized as categorical variables.

3. Generate descriptive statistics and histograms for the numeric variables in the SASHELP.HEART dataset.

a. What is the minimum, maximum, median, and mean of each variable?

b. Do the mean and median seem substantially different for any of the variables?

c. Does Smoking seem to be better suited to be analyzed as a categorical variable or a continuous variable?

The SASHELP.HEART dataset contains several categorical variables whose levels were originally created from values of continuous variables in the dataset. Understanding the relationships between related continuous and categorical predictors in a dataset can inform choices of predictors in later statistical analyses.

4. Explore the variables Weight_StatusSmoking_StatusChol_Status, and BP_Status as follows:

a. Variables Weight_StatusMRW, and Weight:

i. What are the ranges (minimum and maximum) of variables MRW and Weight for each level of Weight_Status?

ii. Are the ranges of MRW for levels of Weight_Status overlapping?

iii. Are the ranges of Weight for levels of Weight_Status overlapping?

iv. Using your answers to the previous two questions, when this dataset was created which values, MRW or Weight, were used to create the levels for Weight_Status?

b. Variables Smoking_Status and Smoking:

i. Which values of Smoking are categorized as Smoking_Status=Non-smokerLightModerateHeavyVery Heavy?

ii. Are any values of Smoking categorized into more than one level of Smoking_Status?

c. Variables Chol_Status and Cholesterol:

i. What are the ranges (minimum and maximum) of Cholesterol for each level of Chol_Status?

ii. Are the ranges of Cholesterol for levels of Chol_Status overlapping?

d. Variable BP_Status:

i. What are the ranges (minimum and maximum) of Diastolic and Systolic for each level of BP_Status?

ii. Are the ranges of Diastolic for levels of BP_Status overlapping?

iii. Are the ranges of Systolic for levels of BP_Status overlapping?

iv. Normal levels of blood pressure are usually defined as under 120 for systolic blood pressure and under 80 for diastolic blood pressure. Based on your answers to the previous questions, are one or both of systolic and diastolic blood pressure required to be high for the individual to be categorized as BP_Status=High?

Exploring patterns of missingness in a dataset gives insight into data collection procedures for the study generating the dataset and may also indicate data entry or data collection errors.

5. Examine missing data in the SASHELP.HEART dataset.

a. Which variables have no missing data?

b. Which variables have missing data?

c. For each variable with missing data, what percent of the data is missing?

d. Using what you currently know about the dataset, given the definition of the variable(s) or given values of other variables in the dataset, which variable(s) have patterns of missingness that could be expected?

6. Examine patterns of missingness on certain groups of variables as follows:

a. If MRW is non-missing, are both Height and Weight always non-missing?

b. If Weight_Status is non-missing, are both Height and Weight always non-missing?

c. If Smoking is non-missing is Smoking_Status always non-missing, and vice versa?

d. If Cholesterol is non-missing, is Chol_Status always non-missing, and vice versa?

e. Analyze DeathCause and AgeAtDeath grouped by Status.

i. Are DeathCause and AgeAtDeath ever missing when Status=Dead?

ii. Are DeathCause and AgeAtDeath ever non-missing when Status=Alive?

f. Analyze AgeCHDdiag grouped by DeathCause. Is AgeCHDDiag ever missing when DeathCause=Coronary Heart Disease?

Missing values can also impact later statistical analyses. SAS statistical procedures perform what is called a complete case analysis, which is to say that analyses will exclude any observation with a missing value for any variable involved in the analysis. Such exclusions can substantially decrease the number of observations in a dataset that are used in a later statistical analysis.

7. Tabulate the percent of observations in the SASHELP.HEART dataset that have non-missing values for all the predictor variables that you will use in later analyses: AgeAtStartBP_StatusChol_StatusCholesterolDiastolicHeightMRWSexSmokingSmoking_StatusSystolicWeight, and Weight_Status.

Does the SASHELP.HEART dataset seem to have a high amount of missing data for any of these predictors?

Part 2: Creating New Variables and Subsetting the Data

An important next step after exploring a dataset is to create any new variables needed for later analyses. The primary outcome of the Framingham Heart study is whether a patient developed coronary heart disease. Interestingly, this variable is not included in the SASHELP.HEART dataset.

1. Use information in the variable AgeCHDdiag to create a variable describing whether a patient developed coronary heart disease. Specifically, if AgeCHDdiag is non-missing, then the individual had coronary heart disease, and if AgeCHDdiag is missing, the individual did not have coronary heart disease.

a. Create a new numeric variable named CHD.

b. Store this new variable in a temporary dataset named WORK.HEART1.

c. Code this variable so that CHD= 1 if AgeCHDdiag takes a value from 0 to 999 and CHD= 0 otherwise.

After creating any new variable, make sure to check your work.

2. Generate descriptive statistics for the variable AgeCHDdiag grouped by CHD.

a. Is CHD a numeric variable?

b. When CHD=1, is AgeCHDdiag always non-missing?

c. When CHD=0, is AgeCHDdiag always missing?

Let’s now turn to creating new predictor variables.

Statistical analyses can determine which variables collected in the Framingham Heart Study are predictive of development of coronary heart disease. To facilitate comparison of levels of categorial predictors, levels of categorial predictors must be recoded so that alphabetical order of the levels also corresponds to ordering the levels by magnitude. This is desirable since statistical procedures use the alphabetic last level as a reference level by default. Re-coding is also useful so that levels appear in a logical order in plots.

3. Re-code categorial variables in the SASHELP.HEART dataset as follows:

a. Use WORK.HEART1 as the input dataset.

b. Create an output dataset named WORK.HEART2.

c. Create a new variable Chol_StatusNew by recoding Chol_Status as follows:

High = 1 High

Borderline = 2 Borderline

Desirable = 3 Desirable

d. Create a new variable Sex_New by recoding Sex as follows:

Male = 1 Male

Female = 2 Female

e. Create a new variable Weight_StatusNew by recoding Weight_Status as follows:

Overweight = 1 Overweight

Normal = 2 Normal

Underweight = 3 Underweight

f. Create a new variable Smoking_StatusNew by recoding Smoking_Status as follows:

Very Heavy (> 25) = 1 Very Heavy

Heavy (16-25) = 2 Heavy

Moderate (6-15) = 3 Moderate

Light (1-5) = 4 Light

Non-smoker = 5 Non-smoker

g. Tabulate each of your new variables as follows to check your work:

i. Tabulate levels of each of the four new variables over all observations.

ii. Tabulate levels of Chol_StatusNew grouped by Chol_Status.

iii. Tabulate levels of Sex_New grouped by Sex.

iv. Tabulate levels of Weight_StatusNew grouped by Weight_Status.

v. Tabulate levels of Smoking_StatusNew grouped by Smoking_Status.

vi. Do you see the expected ordering of levels within each variable (in part i) as well as the expected combinations of levels of re-coded and original variables (in parts ii-v)?

We have now finished creating new variables.

In part 1, question 7, you tabulated the amount of missing data for the set of predictor variables of interest in the SASHELP.HEART dataset. From this, you noticed that only a small percentage (<5%) of observations in the SASHELP.HEART dataset have missing data for any of these variables.

Ideally, statistical analyses for the SASHELP.HEART dataset should be performed only on observations with no missing data for all these predictors. This ensures that all analyses, regardless of the predictors included, use the same number of observations. Given that the amount of missing data is small, analyses can simply exclude any observation with missing data on at least one of the predictors of interest. Other strategies such as single or multiple imputation could be employed, but those are beyond the scope of this exercise.

4. Create a new permanent dataset that can be used for later statistical analyses.

a. Use WORK.HEART2 as the input dataset.

b. Create a library named HEARTLIB.

c. Create an output dataset named HEARTLIB.MYHEART that contains only those observations that have non-missing values for the variables below:

AgeAtStart

Height

Systolic

BP_Status

MRW

Weight

Chol_StatusNew

Sex_New

Weight_Status

Cholesterol

Smoking

Weight_Status2

Diastolic

Smoking_StatusNew

 

 

 

 

 

This dataset should have 5039 observations.

d. Check your work for the dataset HEARTLIB.MYHEART by tabulating values of character variables and generating descriptive statistics for numeric variables. Do you see any missing values in any of the tabulations or statistics generated?

Congratulations- you have completed data preparation for the Framingham Heart Study dataset! A next step in exploring relationships between coronary heart disease and predictors of interest is to perform additional descriptive analyses by creating logit plots.

The related Framingham Heart Study: Descriptive Analysis, Industry Applied Activity provides practice in generating logit plots. Following this, logistic regression models can be fit to formalize the statistical relationships between coronary heart disease and predictors of interest. The related Framingham Heart Study: Statistical Analysis, Industry Applied Activity provides practice in fitting these logistic regression models. These activities can be found in the Academic Hub.

Appendix

Appendix A: Access Software

SAS OnDemand for Academics (ODA) is a free, full suite of cloud-based software that supports the analytics life cycle- from data, to discovery, to deployment. Students can use SAS OnDemand for Academics to get access to SAS Studio for free. Click here to access ODA.

Note: You need to have an established SAS profile linked to an academic affiliation. If you don't have a SAS Profile, click here to set one up.

 

Check out Frequently Asked Questions for more support.

Appendix B: Helpful Documentation

Below are helpful links to documentation regarding the procedures used in the activity.

· The CONTENTS procedure

· The PRINT procedure

· The MEANS procedure

· The FREQ procedure

· The UNIVARIATE procedure

· Base SAS Procedures Guide

· DATA Step Statements: Reference

Appendix C: Recommended Learning

The SAS Global Academic Program offers free e-learning courses for students to learn SAS through the Student Skill Builder. The following e-learning courses and paths available are recommended to help with this activity:

· SAS Programming 1: Essentials

· SAS Programming 2: Data Manipulation Techniques

· Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression

image1.tiff

image3.gif

image2.tiff

Dot Image
Tutorials for this Question
  1. Tutorial # 00860232 Posted By: wildcraft Posted on: 12/10/2024 10:39 PM
    Puchased By: 2
    Tutorial Preview
    The solution of Case Study - SAS Doctoral Level, Framingham Heart Study: Data Preparation...
    Attachments
    Case_Study_-_SAS_Doctoral_Level,_Framingham_Heart_Study_Data_Preparation.ZIP (18.96 KB)

Great! We have found the solution of this question!

Whatsapp Lisa