MADA Data Analysis Project

Jayne Musso contributed to this exercise

Author

Patrick Kaggwa

Published

March 29, 2024

1 Summary/Abstract

Write a summary of your project.

2 Introduction

2.1 General Background Information

Provide enough background on your topic that others can understand the why and how of your analysis

2.2 Description of data and data source

Describe what the data is, what it contains, where it is from, etc. Eventually this might be part of a methods section.

2.3 Questions/Hypotheses to be addressed

3 Methods

Describe your methods. That should describe the data, the cleaning processes, and the analysis approaches. You might want to provide a shorter description here and all the details in the supplement.

3.1 Data aquisition

3.2 Data import and cleaning

3.3 Statistical analysis

Explain anything related to your statistical analyses.

4 Results

4.1 Exploratory/Descriptive analysis

Table 1: Data summary table.

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	factor.ordered	factor.n_unique	factor.top_counts	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	Educ	1	7	13	0	4	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
factor	Gender	1	NA	NA	NA	NA	NA	FALSE	3	M: 4, F: 3, O: 2	NA	NA	NA	NA	NA	NA	NA	NA
numeric	Height	1	NA	NA	NA	NA	NA	NA	NA	NA	165.66667	15.97655	133	156	166	178	183	▂▁▃▃▇
numeric	Weight	1	NA	NA	NA	NA	NA	NA	NA	NA	70.11111	21.24526	45	55	70	80	110	▇▂▃▂▂
numeric	age	1	NA	NA	NA	NA	NA	NA	NA	NA	41.66667	13.39776	22	34	45	54	56	▃▂▂▂▇

4.2 Basic statistical analysis

Figure1: shows a scatterplot figure produced by one of the R scripts.

The scatter shows a slight postive increase in the weight of individuals with increasing age.

Figure2: shows a Boxplot of Height and Education Levels.

The figure shows a box-plot of heights of individuals stratified by education levels.

4.3 Full analysis

Table2: shows a summary of a linear model fit .

Graduate, High school, and Undergraduate do not appear to be statistically significant as their p-values are higher than 0.05, Height of individuals increases by approximately by 1.08cm per one year increase in age.
term	estimate	std.error	statistic	p.value
(Intercept)	116.7253978	14.9101494	7.8285867	0.0014375
age	1.0822039	0.2607536	4.1502937	0.0142569
EducGraduate	0.4293852	8.7286474	0.0491926	0.9631241
EducHigh school	10.2923787	8.5648274	1.2017030	0.2957598
EducUndergraduate	2.4796700	11.7223156	0.2115341	0.8428112

4.4 Code used for the analysis

##############################################
### Box Plot
b1<-mydata %>%
  ggplot(mapping = aes(x = `Educ`, y = Height, fill = `Educ`)) +
  geom_boxplot() +
  scale_fill_manual(values = c("College" = "#1f78b4", "High school" = "#33a02c", "Graduate" = "#e31a1c", "Undergraduate" = "#ff7f00")) +
  theme_minimal() +
  labs(x = "Education levels", y = "Height") +
  ggtitle("Boxplot of Education Levels by Height") +
  theme(plot.title = element_text(hjust = 0.5))  # Adjust title alignment
b1
figure_file = here("starter-analysis-exercise","results","figures","education-Height-stratified.png")
ggsave(filename = figure_file, plot=b1)
##############################################
For the scatter Plot
s1 <- ggplot(mydata, aes(x = Weight, y = age)) +
  geom_point() +
  stat_smooth(method = "glm", formula = y ~ x) +
  ggtitle("Scatterplot of Weight vs Age") +
  labs(x = "Weight", y = "Age")
s1
figure_file = here("starter-analysis-exercise","results","figures","Weight-Age-stratified.png")
ggsave(filename = figure_file, plot=s1)

############################
#### Third model fit
# fit linear model using height as outcome, age and Educatinal Levels as predictor

lmfit3 <- lm(Height ~ age + Educ, mydata)  

# place results from fit into a data frame with the tidy function
lmtable3 <- broom::tidy(lmfit3)

#look at fit results
print(lmtable3)

# save fit results table  
table_file3 = here("starter-analysis-exercise","results", "tables-files", "resulttable3.rds")
saveRDS(lmtable3, file = table_file3)

5 Discussion

5.1 Summary and Interpretation

Summarize what you did, what you found and what it means.

5.2 Strengths and Limitations

Discuss what you perceive as strengths and limitations of your analysis.

5.3 Conclusions

What are the main take-home messages?

Include citations in your Rmd file using bibtex, the list of references will automatically be placed at the end

This paper (Leek & Peng, 2015) discusses types of analyses.

These papers (McKay, Ebell, Billings, et al., 2020; McKay, Ebell, Dale, Shen, & Handel, 2020) are good examples of papers published using a fully reproducible setup similar to the one shown in this template.

6 References

Leek, J. T., & Peng, R. D. (2015). Statistics. What is the question? Science (New York, N.Y.), 347(6228), 1314–1315. https://doi.org/10.1126/science.aaa6146

McKay, B., Ebell, M., Billings, W. Z., Dale, A. P., Shen, Y., & Handel, A. (2020). Associations Between Relative Viral Load at Diagnosis and Influenza A Symptoms and Recovery. Open Forum Infectious Diseases, 7(11), ofaa494. https://doi.org/10.1093/ofid/ofaa494

McKay, B., Ebell, M., Dale, A. P., Shen, Y., & Handel, A. (2020). Virulence-mediated infectiousness and activity trade-offs and their impact on transmission potential of influenza patients. Proceedings. Biological Sciences, 287(1927), 20200496. https://doi.org/10.1098/rspb.2020.0496