Data Exercise

Tuberculosis (TB) remains a major global health concern, with millions of new cases reported annually. This exercise reviews and analyzes key factors associated with TB, ranging from demographic and lifestyle factors to health-related variables. The aim of analysis provides understanding of factors associated to TB.The data used in this exercise is synthetic data and was created with the help of a generative Pre-trained Transformer 3.5 (ChatGPT3.5)

# Install and load necessary packages
#install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set a seed for reproducibility
set.seed(123)

The set seed was used above for reproducibility reasons and randomly created n =998 observations

Data Processing

# Populate the dataframe with synthetic data
n <- 998  # Number of observations

# creating and populating Variables
tb_data <- data.frame(
  age = sample(18:65, n, replace = TRUE),  # Sample ages from 18 to 65
  gender = sample(c("Male", "Female"), n, replace = TRUE),  # Randomly assign gender
  smoking_status = sample(c("Smoker", "Non-Smoker"), n, replace = TRUE),  # Randomly assign smoking status
  diabetes_status = sample(c("Diabetic", "Non-Diabetic"), n, replace = TRUE),  # Randomly assign diabetes status
  urban_rural_status = sample(c("Urban", "Rural"), n, replace = TRUE)  # Randomly assign urban/rural status
)

# Create associations between variables and TB status
tb_data$tb_status_prob <- plogis(0.05 * (tb_data$age - 65) +
                                   ifelse(tb_data$smoking_status == "Smoker", 1, 0) +
                                   ifelse(tb_data$diabetes_status == "Diabetic", 1, 0) +
                                   ifelse(tb_data$urban_rural_status == "Urban", 0.5, 0))  # Calculate the probability of having TB

# Generate TB status based on probability
tb_data$tb_status <- ifelse(runif(n) < tb_data$tb_status_prob, 1, 0)  # Assign 1 for TB case, 0 for not TB case

# Summary statistics
summary(tb_data)
      age           gender          smoking_status     diabetes_status   
 Min.   :18.00   Length:998         Length:998         Length:998        
 1st Qu.:29.00   Class :character   Class :character   Class :character  
 Median :42.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :41.43                                                           
 3rd Qu.:53.00                                                           
 Max.   :65.00                                                           
 urban_rural_status tb_status_prob      tb_status    
 Length:998         Min.   :0.08707   Min.   :0.000  
 Class :character   1st Qu.:0.33181   1st Qu.:0.000  
 Mode  :character   Median :0.51250   Median :0.000  
                    Mean   :0.50969   Mean   :0.498  
                    3rd Qu.:0.67918   3rd Qu.:1.000  
                    Max.   :0.92414   Max.   :1.000  

Data Visualizations

Now I will carry some visualizations for some the variables

# Boxplot of Age by TB Status
ggplot(tb_data, aes(x = factor(tb_status), y = age, fill = factor(tb_status))) +
  geom_boxplot() +
  labs(title = "Boxplot of Age by TB Status",
       x = "TB Status",
       y = "Age")

The 0 = No TB and 1 = TB

# Boxplot of Age by Smoking Status
ggplot(tb_data, aes(x = factor(tb_status), y = age, fill = tb_status)) +
  geom_boxplot() +
  labs(title = "Boxplot of Age by tb_status",
       x = "tb_status",
       y = "Age")

# Histogram of Age Stratified by TB Status
ggplot(tb_data, aes(x = age, fill = factor(tb_status))) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
  labs(title = "Histogram of Age Stratified by TB Status",
       x = "Age",
       fill = "TB Status")

The 0 = No TB and 1 = TB

Now I will compare the distribution of age across TB Status.

# Violin Plot of Age by TB Status
ggplot(tb_data, aes(x = factor(tb_status), y = age, fill = factor(tb_status))) +
  geom_violin() +
  labs(title = "Violin Plot of Age by TB Status",
       x = "TB Status",
       y = "Age",
       fill = "TB Status")

# Bar Plot of Smoking Status Stratified by TB Status
ggplot(tb_data, aes(x = smoking_status, fill = factor(tb_status))) +
  geom_bar(position = "dodge") +
  labs(title = "Bar Plot of Smoking Status Stratified by TB Status",
       x = "Smoking Status",
       fill = "TB Status")

The bar graph shows that TB is more prevalent in smokers compared to the non smokers

# Creating a table stratified by TB status

# List to store cross-tabulation tables for each variable
cross_tab_list <- list()

# Cross-tabulation table for tb_status and gender
cross_tab_gender <- table(tb_data$tb_status, tb_data$gender)
cross_tab_list[["Gender"]] <- cross_tab_gender

# Cross-tabulation table for tb_status and smoking_status
cross_tab_smoking <- table(tb_data$tb_status, tb_data$smoking_status)
cross_tab_list[["Smoking Status"]] <- cross_tab_smoking

# Cross-tabulation table for tb_status and diabetes_status
cross_tab_diabetes <- table(tb_data$tb_status, tb_data$diabetes_status)
cross_tab_list[["Diabetes Status"]] <- cross_tab_diabetes

# Cross-tabulation table for tb_status and urban_rural_status
cross_tab_urban_rural <- table(tb_data$tb_status, tb_data$urban_rural_status)
cross_tab_list[["Urban/Rural Status"]] <- cross_tab_urban_rural

# Combine all cross-tabulation tables into one table using cbind
combined_cross_tab <- do.call(cbind, cross_tab_list)

# Print the combined table
cat("Combined Cross-tabulation Table for TB Status and Other Variables:\n")
Combined Cross-tabulation Table for TB Status and Other Variables:
print(combined_cross_tab)
  Female Male Non-Smoker Smoker Diabetic Non-Diabetic Rural Urban
0    228  273        288    213      202          299   291   210
1    254  243        187    310      274          223   251   246

The above table shows the distributions of TB status across different factors

Model Fitting

Now I will fit the different variables to look at the association using step-wise method

# Fitting a logistic regression models 
model1 <- glm(tb_status ~ age, data = tb_data, family = "binomial")
summary(model1)

Call:
glm(formula = tb_status ~ age, family = "binomial", data = tb_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.068982   0.223218  -9.269   <2e-16 ***
age          0.049734   0.005141   9.675   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.5  on 997  degrees of freedom
Residual deviance: 1280.4  on 996  degrees of freedom
AIC: 1284.4

Number of Fisher Scoring iterations: 4

Model1 suggests that age is a statistically significant associated with Tuberculosis with P-value < 0.05

model2 <- glm(tb_status ~ age + gender,
                      data = tb_data, family = "binomial")
summary(model2)

Call:
glm(formula = tb_status ~ age + gender, family = "binomial", 
    data = tb_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.954275   0.233296  -8.377   <2e-16 ***
age          0.049675   0.005145   9.654   <2e-16 ***
genderMale  -0.216892   0.133752  -1.622    0.105    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.5  on 997  degrees of freedom
Residual deviance: 1277.7  on 995  degrees of freedom
AIC: 1283.7

Number of Fisher Scoring iterations: 4

Model2 suggests that gender variable may not be a statistically significant predictor of TB status in this model with p-value > 0.05

model3 <- glm(tb_status ~ age + gender + smoking_status,
                      data = tb_data, family = "binomial")
summary(model3)

Call:
glm(formula = tb_status ~ age + gender + smoking_status, family = "binomial", 
    data = tb_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.472177   0.256980  -9.620  < 2e-16 ***
age                   0.051108   0.005273   9.693  < 2e-16 ***
genderMale           -0.214574   0.136601  -1.571    0.116    
smoking_statusSmoker  0.868348   0.137558   6.313 2.74e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.5  on 997  degrees of freedom
Residual deviance: 1236.7  on 994  degrees of freedom
AIC: 1244.7

Number of Fisher Scoring iterations: 4

Age and smoking status are statistically significant associated with Tuberculosis status in model3. However, the gender variable does not appear again to be statistically significant, as the p-value is above the conventional significance level.

model4 <- glm(tb_status ~ age + gender + smoking_status + diabetes_status ,
                      data = tb_data, family = "binomial")
summary(model4)

Call:
glm(formula = tb_status ~ age + gender + smoking_status + diabetes_status, 
    family = "binomial", data = tb_data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -2.216070   0.262147  -8.454  < 2e-16 ***
age                          0.054673   0.005436  10.058  < 2e-16 ***
genderMale                  -0.209714   0.138860  -1.510    0.131    
smoking_statusSmoker         0.877283   0.139874   6.272 3.57e-10 ***
diabetes_statusNon-Diabetic -0.779280   0.141301  -5.515 3.49e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.5  on 997  degrees of freedom
Residual deviance: 1205.4  on 993  degrees of freedom
AIC: 1215.4

Number of Fisher Scoring iterations: 4

Age, smoking status, and diabetes status are statistically significant associated with TB in this model4. However, the gender again does not appear to be statistically significant, as the p-value is above the conventional significance level.

model5 <- glm(tb_status ~ age + gender + smoking_status + diabetes_status + urban_rural_status,
                      data = tb_data, family = "binomial")
summary(model5)

Call:
glm(formula = tb_status ~ age + gender + smoking_status + diabetes_status + 
    urban_rural_status, family = "binomial", data = tb_data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -2.384925   0.273729  -8.713  < 2e-16 ***
age                          0.055037   0.005444  10.110  < 2e-16 ***
genderMale                  -0.193101   0.139444  -1.385   0.1661    
smoking_statusSmoker         0.869065   0.140283   6.195 5.82e-10 ***
diabetes_statusNon-Diabetic -0.780349   0.141733  -5.506 3.68e-08 ***
urban_rural_statusUrban      0.329746   0.140109   2.353   0.0186 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.5  on 997  degrees of freedom
Residual deviance: 1199.9  on 992  degrees of freedom
AIC: 1211.9

Number of Fisher Scoring iterations: 4

Age, smoking status, diabetes status, and urban/rural status are statistically significant associated with TB in model5. However,again the gender variable does not appear to be statistically significant, as the p-value is above the conventional significance level.

Conclusion

Using Akaike Information Criterion for model selection. Model 5 has the lowest AIC (1231.2), suggesting it provides a better trade-off between goodness of fit and simplicity. The inclusion of additional variables (gender, urban/rural status) doesn’t necessarily improve model fit, as indicated by AIC and residual deviance.