Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy

Question

Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy

Question # 00862885 Posted By: wildcraft Updated on: 11/06/2024 12:03 AM Due on: 11/06/2024

Subject General Questions Topic General General Questions Tutorials:

Question

Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy

By the end of the week, you should be able to:

1. Perform exploratory data analysis using suitable visualization tools.

2. Learn data preparation to build various ML algorithms.

3. Understand data strategy for addressing different business problems.

4. Develop classification algorithms such as logistic regression, decision tree learning, and random forest to improve sales conversion.

5. Understand the importance of clustering and build clusters using techniques such as K-means clustering and hierarchical clustering.

6. Identify cluster characteristics and corresponding business insights.

7. Demonstrate the application of recommender systems in cross-selling to customers.

Which Public Health Factors have the Greatest Impact on Life Expectancy?

Life expectancy is the crucial metric for evaluating population health. It provides the average number of years that a group of people in a population is estimated to live. This factor is estimated based on various public health factors. The task of this project is to determine what are the various factors which can help in determining life expectancy.

Data Source:

The raw data was extracted from Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status. The various features of the dataset include:

Features include:
Country	HIV\AIDS	Measles
Year	Hepatitis B	Body Mass Index (BMI)
Life expectancy	Polio	Status
Adult mortality	Diphtheria	Prevalence for malnutrition 5-9
Infant mortality	Gross Domestic Product (GDP)	Education
Alcohol consumption	Population	Total expenditure on health
Expenditure on health (%)	Prevalence for malnutrition 1-19	Status

Task 1:

Read the raw data from the source file in Python.

Perform feature engineering:

1. Population Size – Create a population range that includes three categories:

1. Small – a population between 1,000 and 29,999,

2. Medium – a population between 30,000 and 99,999, and

3. Large - a population of 100,000 or more.

2. Lifestyle – Create a lifestyle feature that combines alcohol consumption and BMI.

3. Economy – Create an economy feature that combines population and GDP.

4. Death Ratio – Determine the death ratio between adult and infant mortality.

Task 2:

Perform data cleaning by either removing any fragmented observations or by imputing missing values as necessary. Generate scatter plots between each predictor with the target variable to check the linear relationship and apply data transformations like log transform, if necessary.

Task 3:

Generate a correlation heat map to assess multicollinearity with the threshold set as 0.75. All variables above 0.75 need to be dropped.

Task 4:

Eliminate possible outliers by generating box-whisker plots.

Task 5:

Perform data analysis to answer the following questions:

· Should a country having a lower life expectancy value (<65) increase its healthcare expenditure to improve its average lifespan?

· What is the impact of schooling on the lifespan of humans?

· Does Life Expectancy have a positive or negative relationship with drinking alcohol?

· Do densely populated countries tend to have a lower life expectancy?

Task 5:

Split the remaining data into around 75% for training and 25% for the test set. Train the linear regression model and assess the performance on the training set, test set, and the entire dataset.

For assessing model performance, use various metrics such as Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and R2 Score.

Draw a residual scatter plot between the target variable on the x-axis and predicted values on the y-axis. The scatter plot should contain an ideal unity line that represents the cases when predicted values are the same as target values. The plot will contain dotted error lines corresponding to +/- 5 colored as yellow and +/- 10 years colored as red. These lines will provide easier visualization of data performance to see data scatter.

Draw residual histogram.

Perform appropriate cross-validation to check if the linear regression model has data overfit. Generate a box plot to display model performance for each fold. Also, determine the mean and standard deviation of overall performance.

Task 6:

Determine the minimum number of features and which features need to be included to ensure that all the data is bound within the error lines mentioned above.

Rating:

4.9/5

Tutorials for this Question

Get the Solution

Great! We have found the solution of this question!

score 2 · Accepted Answer · 11/06/2024 12:04 AM

Solution: Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy

Tutorial # 00858394 Posted By: wildcraft Posted on: 11/06/2024 12:04 AM

Puchased By: 2

Tutorial Preview

The solution of Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy...

Get the Solution

Attachments

Week_5_-_Public_Health_Factors_have_the_Greatest_Impact_on_Life_Expectancy.ZIP (18.96 KB)

Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy

Solution: Week 5 - Public Health Factors have the Greatest Impact on Life Expectancy

Related Questions and Answers

Whatsapp our consultant to discuss your concerns happy to help :)

Lisa