Abstract

In recent years, people worldwide have been focusing on living healthy lifestyles. As a result, people have been focused on their body mass index or body fat percentage. Getting an accurate reading for body mass index can be expensive and inconvenient, as it is done using underwater weighing. People that do not have facilities or the means to accommodate body mass index reading can use a measuring tape and weight scale.

This paper aims to analyze the dataset to develop a framework to approximate the body mass index utilizing weight, height, age, and various body circumference measurements. The provided dataset contains estimates of body mass index by underwater weighing and body measurements from 252 men. The body measurements include the circumference of these body parts: neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, and wrist.

Table of Contents

Introduction. 4

Preview of Data and Data Analysis Method. 5

Analysis and Code. 6

Multiple Linear Regression. 6

Decision Tree. 12

Results. 16

Comparison with Similar Studies 16

Accuracy Measure for Multiple Linear Regression. 18

Accuracy Measure for Decision Tree. 20

Conclusion and Recommendations. 24

References. 27

Introduction

Body fat percentage is an essential indicator of whether someone is at risk of developing obesity-related diseases that can range from high blood pressure to heart diseases. An individual’s weight is not a strong factor when identifying whether someone is at risk of developing obesity-related diseases such as hypertension, early atherosclerosis, and hyperlipidemia, as noted by Chatterjee, Chatterjee, and Bandyopadhyay (2006). However, identifying someone’s body fat percentage can be useful when identifying such risks despite the individual’s weight. To address the risk presented by obese related diseases, it is important to identify the body fat percentage of individuals. Various factors can determine body fat percentage. Examples of these factors can include age, weight, height, as well as other measurements such as neck circumference. This project aims to develop a model that can be used to calculate the body fat percentage. The proposed model is based on data collected from 252 study participants. The data used in the development of the model to estimate the body fat percentage includes 19 attributes, with two of the attributes being calculated body fat percentages. These attributes include percent body fat using Brozek and percent body fat using Siri. For the development of the model to estimate body fat percentage, the percent body fat using Brozek is preferred based on the comparison made by Guerra et al. (2010) to identify the better alternative between the use of the Brozek equation and Siri equation. When developing the model, the variables to consider include age, density, weight, height, adiposity index, fat-free weight, and circumferences of various parts of the body. These parts include the neck, chest, abdomen, hip, thigh, knee, ankle, extended biceps, forearm, and wrist. The project aims to develop a model that can be used to estimate the body fat percentage based on some or all of the identified variables.

Preview of Data and Data Analysis Method

The dataset used in this project includes 252 instances and 19 attributes. All the data is numeric, and there are no missing values for all the attributes. An essential aspect of the data mining process includes the data preparation stage. Under the data preparation stage, some of the activities that are performed include the recovery of incomplete data that includes filling missing values, purification of the data that includes correcting errors in the data set, and resolution of data conflicts (Zhang, Zhang, & Yang, 2003). Analysis of the provided data set did not present any need for conducting some of the activities conducted in the data preparation stage. Therefore, the next stage involves conducting data analysis. The development of the model is conducted using multiple linear regression and decision trees using the Statistical Package for the Social Sciences (SPSS) tool.

The application of multiple linear regression stemmed from the presence of multiple variables in the data set that can be used to determine the body fat percentage of individuals. Unlike linear regression, multiple linear regression can use multiple independent variables to predict the outcome of a dependent variable. Some of the previous studies conducted on the issue of body fat percentage also make use of multiple regression. An example is a study by Weiler et al. (2000), which uses multiple linear regression after multiple correlation analyses to predict bone mineral content and density using weight, age, height, and fat. For this study, the assumption made for the use of multiple regression is that the body fat percentage is directly related to a linear amalgamation of the noted attributes (Tranmer & Elliot, 2008). Similar to the use of multiple regression, the decision trees provide the relationships between the independent attributes in relation to the dependent attribute. The identification of the relationships between the independent variables can then be used to predict the independent attributes. In the study conducted by Uçar et al. (2021), decision trees are some of the methods that are used to develop a body fat percentage prediction model.

Analysis and Code

Multiple Linear Regression

Linear regression is used to create a linear relationship between an input variable and an output variable. This means that the output (y) can be calculated from the input variable (x). In some cases, the input variables might be more than one and this creates the multiple linear regression where the assumption is that the output variable (y) can be calculated from the multiple input variables. The multiple linear regression equation includes:

Y = B₀ + B₁X₁ + B₂X₂ +…..+B_pX_p

Y represents the dependent variable that is predicted from the multiple independent variables. B₀represents the constant that includes the value of Y when the multiple independent variables are equal to zero. The independent variables are represented by the values X₁, X₂,…and X_p. The calculated regression coefficients are represented by B₁, B₂,…and B_p.

Before the establishment of the multiple linear regression model, one of the activities that can be performed during multiple linear regression analysis includes identifying the correlation coefficients between the independent variables and the dependent variable. This project aims to identify the relationship between percent body fat and multiple dependent variables. The calculation of the Pearson correlation coefficient allows for the identification of how strong the relationship between the independent variables is with the dependent variable. For this project, the dependent variable is the percent body fat calculated using the Brozek model. Therefore, the correlation coefficients identify how strong the relationship between the calculated body fat using the Brozek model and the other independent variables such as density, age, weight, height, adiposity index, fat free weight, and the various circumferences. The results of the calculated Pearson correlation coefficients are included below:

Image 1: Pearson Correlation Coefficients

The results indicate a strong relationship between percent body fat calculated using the Brozek model and the density and abdomen circumference variables. A strong relationship exists between percent body fat and the weight variable, adiposity index variable, chest circumference variable, and hip circumference variable. A moderate relationship exists between the percent body fat and neck circumference, thigh circumference, knee circumference, and the extended biceps circumference. The relationship between body fat percentages and the age attribute, ankle circumference variable, forearm circumference variable, and wrist circumference variable can be classified as being weak. A very weak relationship exists between the body fat percentage and height variable and fat free weight variable. Most of the relationships between body fat percentage and the independent variables are positive. However, there are negative relationships between body fat percentage and density variable and the height variable. The strength of the Pearson correlation coefficient is based on the table defined by Liang et al. (2019).

Image 2: Pearson Correlation coefficient strength (Liang et al., 2019).

Analysis of the p-value indicates that the relationships between most of the variables and the body fat percentage are not due to chance. However, the relationship between body fat percentage and height variable and fat free weight variable can be attributed to having occurred due to chance in addition to having a weak correlation. As seen in the image below, the p-values of the height variable and fat free weight variable are more than 0.05, and this is evidence that the relationship could have occurred due to coincidence. Since the p-values for the other variables are less than 0.05, then the relationships are statistically significant. While the p-values of height and fat free weight are not statistically significant, we will not eliminate them from the multiple linear regression model.

Image 3: P-values of Height and Fat Free Weight.

After the identification of the correlation coefficients, the next step involves performing multiple linear regression analysis on the data set. The first results of the analysis includes the summary of the model that identifies how well the multiple linear regression model fits the data set used in the analysis. The model submarine includes R value, R squared, adjusted R squared, and the standard error of the estimate.

Image 4: Model Summary

The R value in this study is 0.994. The R value in the model summary aims to identify the quality of prediction of the model generated. A higher R value is an indication of a higher quality of prediction of the dependent variable. The R value of 0.994 indicates that there is a high level of quality of prediction of the body fat percentage. The R squared variable identifies the part of the dependent variable that the independent variable can describe. In this study, the R squared value is 0.987. This shows that various independent variables included in the model explain 98.7% of the changes in the dependent variable. Similarly, the adjusted R Squared value can also explain the variability in the dependent variable, which is the body fat percentage in this study. The next results of the multiple linear regression include the ANOVA table.

Image 5: ANOVA table

The ANOVA table is used to identify whether the independent variables used in the multiple linear regression model can statistically significantly predict the body fat percentage, which is the dependent variable. The results from the ANOVA table include F(16, 235) = 1138.148, p <.0005. The results of the ANOVA table illustrate that the calculated regression model is a good fit for the data set used in the analysis. The next results of the multiple linear regression analysis include the coefficients table. The coefficient table produces the coefficients that can be used to generate the multiple linear regression model.

Image 6: Coefficients Table

From the coefficients table, the multiple linear regression model includes:

Predicted body fat percentage = 253.259-(234.097*Density)+(0.006*Age)+(0.159*Weight)+(0.013*Height)-(0.234*Adiposity Index)-(0.230*Fat Free Weight)+(0.020*Neck Circumference)+(0.069*Chest Circumference)+(0.024*Abdomen Circumference)+(0.019*Hip Circumference)+(0.069*Thigh Circumference)+(0.012*Knee Circumference)+(0.003*Ankle Circumference)-(0.003*Extended biceps Circumference)+(0.099*Forearm Circumference)+(0.163*Wrist Circumference)

From the model, the predicted body fat percentage reduces with an increase in the Density, Adiposity index, Fat free weight, and Extended biceps circumference but increases with an increase in the other variables.

Decision Tree

One of the issues associated with linear regression in determining the relationship between independent variables and a dependent variable is that the method does not adequately describe the relationship between the independent variables and the dependent variable. An example of this can be seen in the data set used in this study, where there are multiple independent variables and a dependent variable. In this study, it is possible that the body fat percentage presents a strong negative correlation to the density variable only in instances where there weight variable is above a certain level. Multiple linear regression cannot identify such instances where the correlation between the predictors and the independent variable depends on other predictors. Therefore, to identify such relationships, the use of a decision tree is required. Decision trees involve splitting the most relevant inputs into categories. The decision to use allows for easy to analyze results as well as allowing for quick predictions when using the generated model to predict outcomes. To conduct the decision tree classification, the body fat percentage variable is used as the dependent variable, and the other variables are used as the independent variables. The CHAID (Chi-Square Automatic Interaction Detector) growing method is used since it allows for the classification of data into two or more categories. Additionally, CHAID growing method is preferred since it accommodates a large sample size with more than 100 instances. The results of the generated decision tree are included below:

Image 7: Decision Tree Model Summary

Image 8: Decision Tree

Image 7 includes a summary of the decision tree model summary. Some of the aspects included in the model summary include the growing method used in the analysis, the number of nodes of the tree, the depth of the tree, and the independent variables that are included in the generated decision tree. While all the independent variables were defined, only two variables were included in the decision tree. These two variables include the density variable and the abdomen variable. The categories created for the density variable include values less than or equal to 1.06030, values between 1.06030 and 1.07280, and values above 1.07280. 59% of the instances are less than or equal to 1.06030, 20.2% are between 1.06030 and 1.07280, and 19.8% are above 1.07280. The depth of the decision tree is two and the other variable, abdomen circumference, is attached to the node with values less than or equal to 1.06030. The node with values between 1.06030 and 1.07280 and the node with values above 1.07280 are terminal nodes, which indicate where the decision tree stopped growing. The abdomen circumference variable is classified into two nodes. The first node contains values that are less than or equal to 98 and the second node includes values that are above 98. 31.3% of the instances belong to the less than or equal to 98 category, and 28.6% of the instances belong to the above 98 category. After the abdomen circumference variable classification, there are no additional variables included in the decision tree. The results of the decision tree are in accordance with the results of Pearson’s correlation coefficients. This is where both density and abdomen circumference variables had very strong relationships with the body fat percentage variable. From the decision tree, a conclusion that can be defined includes the density variable, and the abdomen circumference are the main attributes that can be used to predict the body fat percentage of an individual.

Additional results from the decision tree analysis include the risk estimate table. For this study, the risk estimate table is included below:

Image 9: Risk Estimate Table

The risk estimate table includes 13.225, which indicates that the probability of the model making a wrong prediction is high.

Results

Comparison with Similar Studies

In a study that utilizes a similar data set to the data set used in this study, Uçar, Ucar, Köksal, and Daldal (2021) aim at developing a model that can be used to measure the body fat percentage to allow for obesity treatment. Uçar et al. (2021) note that the devices that are used to measure body fat percentage are expensive and, therefore, not recommended for use in clinical settings. However, it is much easier to calculate the various variables included in the data set used in this study, such as height, weight, and the circumferences of different body parts. While this study makes use of Pearson correlation coefficients to identify the strength and direction of the correlation between the body fat percentage and the other attributes, the study by Uçar et al. (2021) makes use of the Spearman correlation coefficients. Uçar et al. (2021) also use the body fat percentage that was obtained using the Siri method rather than the Brozek method used in this study. Similar to this study’s results on correlation coefficients, the abdomen circumference is noted to have a strong correlation with body fat percentage. In addition to the use of decision tree regression, Uçar et al. (2021) also use Multilayer Feedforward Neural Networks and Support Vector Machine Regression Model to create a predictor model for body fat percentage. Compared to the results obtained in this study, Uçar et al. (2021) identify the importance of abdomen circumference similar to this study; however, they do not consider the density variable essential to the model development. This is different from this study that identified the importance of the density variable when determining the body fat percentage.

In this study, the multiple linear regression method and the decision tree method are used to create models that can be used to predict the body fat percentage based on measurements that are easy to obtain. As seen in the study by Uçar et al. (2021), machine learning methods can also be applied on easy to access measurements. A similar study conducted by Chiong, Fan, Hu, and Chiong (2021) also makes use of a machine learning algorithm, which includes the support vector machine, to create a model for the prediction of body fat percentage. Chiong et al. (2021) also make use of a similar data set to the data set used in this study but also does not include the density attribute in the model similar to the study by Uçar et al. (2021). Chiong et al. (2021) add a bias error control to the relative error support vector machine to increase the accuracy of the model in the prediction of body fat percentage. Compared to the decision tree model included in this study, the relative error support vector machine model provides a better model to use in the prediction of the body fat percentage due to the high risk estimate noted in the decision tree results.

Similar to this study, the study by Johnson (2021) also used the multiple regression model to develop a method that can estimate the body fat percentage from a list of anthropometric variables. Johnson (2021) used a dataset consisting of 184 instances collected from women aged between 18 and 25 years. Some of the attributes included in the data set used by Johnson (2021) include percent body fat, weight, height, age, and the circumferences of various parts of the body, including neck, knee, chest, and ankle biceps, and elbow. Analysis of the data indicates that the age and the height variables might not be useful in the development of the model, while hips circumference and weight variables are identified to be useful in the determination of percent body fat. Johnson (2021) also identified the strength and direction of the relationship between the independent variables and the percent body fat using Pearson correlation coefficients. The BMI and waist circumference variables are not to have the strongest correlation with percent body fat. The multiple linear regression model defined in this study makes use of all the independent variables to predict the percent body fat. Johnson (2021) notes that it is inconvenient for someone to measure such a high number of values to use in the prediction of body fat percentage. Therefore, based on the adjusted R squared values, Johnson (2021) makes use of three variables in the regression model developed. These variables include the BMI variable, the hips circumference variable, and the waist circumference variable. While the model created in this study has a 0.9042 standard error, the model created by Johnson (2021) using only three variables had a lower standard error of 3.543%.

Accuracy Measure for Multiple Linear Regression

The multiple linear regression model generated to predict body fat percentage can be validated through two methods. The first method involves the use of the R-squared and the adjusted R-squared values and the second method involves the use of residual plots. In the first method, the model’s validity is determined by the variation of the body fat percentage variable that is predicted from the multiple independent variables used to predict the body fat percentage. A zero value in the R-Squared values indicates that the model is poor in the prediction process, while a one indicates that the model is perfect. The higher the R-squared, the better the prediction model, and in this study, the obtained R-squared value as seen in image 4 is .987, which is high, indicating the model is close to perfect prediction. The adjusted R-squared takes into consideration the degrees of freedom of the model and the issue presented by the incorporation of additional variables. Similar to the R-Squared values obtained in this study, the adjusted R-Squared is .986, indicating a near-perfect prediction. An issue with the use of R-Squared value to validate a multiple linear regression is that it is likely to occur due to a highly biased model, which can remove the validity of the model. Therefore, to confirm the results obtained from the R-Squared values, the use of residual plots can be applied to validate the model.

Residuals measure the error between the predicted value by the model and the actual value. Residual plots involve plotting the standardized predicted value on the x axis and the standardized residual on the y axis. The use of residual plots in identifying the validity of the multiple linear regression model assumes that the errors that occur between the actual value and the predicted values are independent and normally distributed. Each regression model contains some errors between the actual and predicted values since it is impossible to predict the actual value. Residual plots that are useful in the validation of a regression model consist of the majority of points located near the origin. The residual plot developed for this study to validate the model’s accuracy is shown in the image below.

Image 10: Residual Plot

The results of the residual plot displayed in image ten above indicate that the regression model satisfies the assumptions made concerning the good residual plots. In image ten above, the residual plot, the majority of the points are located near the origin, with only a few of the points being located away from the origin. This shows that the residuals are normally distributed. Drawing a distribution curve on the y axis confirms that the distribution of the residual is normal since the peak of the curve would be at the center where the majority of the points are located. The independence of the residuals can be seen in the lack of any pattern in how the residuals are distributed on the scatter plot. The existence of patterns in the residuals can illustrate an issue with the model generated since one residual can be used to predict another residual. While the R-Squared value presents a numeric confirmation of the validity of the multiple linear regression model generated in this study, the residual plot provides visual confirmation of the validity of the generated model.

Accuracy Measure for Decision Tree

Similar to the validation of the prediction model generated through multiple linear regression analysis, this study also involved the validation of the prediction model generated using decision tree analysis. Decision trees can be validated through two methods that allow for the identification of whether the results model developed can be generalized. These validation methods include cross validation and split-sample validation. To ensure that the model generated in this study can be applied to predict body fat percentage from another data set, both methods of validation are applied. The application of the cross validation method produced a decision tree similar to the decision tree generated earlier. Cross validation involves the division of the data sent into multiple samples. In this study, the sample folds used were 10. Each sample generates trees based on the other samples but not on its own sample. The exclusion of the sample generating the tree leads to the estimation of the risk, which is summed to provide the cross validation estimate risk. The decision tree generated using the cross validation method has six nodes similar to the decision tree included in image 8. The risk estimate of the cross validation method is indicated in the image below.

Image 11: Cross Validation Risk Estimate

As seen in image 11 above, the cross validation has a higher risk estimate when compared to the resubstitution method. The high risk estimate indicates a high probability of the generated model making an error when making a prediction.

The split sample validation is also applied in this study to determine the accuracy of the generated model in predicting body fat percentage. The split sample validation involves the definition of a training sample and a test sample. This means that a percentage of the data set is defined for use in training the model while the rest of the data set is used to test the generated model. In this study the training sample is 95% of the data set while the rest is used in training the model. The high percentage of training sample size is due to the size of the data set. With 252 instances, the data set is not large enough to create a good model that can be used on the test data. Therefore, the definition of 95% of the data to be used in the training sample aims at removing the errors that can occur following the generation of a poor model. The use of both a training sample and a test sample means that there are two decision trees that are generated. The generated decision trees are included in the images below.

Image 12: Training Sample Decision Tree

Image 13: Test Sample Decision Tree

The decision trees generated using the split sample validation method are different when compared to the decision tree generated without validation. One of the differences noted includes the presence of just 5 nodes compared to 6 nodes of the decision tree generated without validation and the decision tree generated using the cross validation method. Since there are only 5 nodes, the node of the density variable that included values between 1.06030 and 1.07280 is not included and the node that defined values above 1.07280 defines values above 1.06030. Despite these differences, the essential variables identified in all decision trees include the density and abdomen circumference variables. The risk table for the decision tree that made use of the split sample validation can be seen below.

Image 14: Split Sample Risk Estimate

The training sample indicated a higher risk when compared to the test sample. Additionally, the risk estimate for the training sample is higher when compared to the risk identified in the decision tree generated without any validation method.

Conclusion and Recommendations

One of the aspects of data mining that is ignored in this study is data pruning specifically for the generation of the decision tree model. Data pruning involves the elimination of some of the components of the generated models to improve the learning process of the models. For both regression and decision tree analysis, data pruning is ignored due to the small sample size of the data set used in the generation of the models. An example is the decision tree model generated includes only 6 nodes with a depth of 2. Data pruning would not have been useful in improving the learning process of the model. Additionally, the CHAID method is applied in the generation of the decision tree. Data pruning is only applicable when using the CRT and the QUEST methods in the generation of the decision tree.

Comparing the two methods used in the generation of a predictor model for body fat percentage using various measurements, the multiple linear regression model is better than the decision tree model. The analysis using both the regression and decision tree models indicates that the density and abdomen circumference variables are the highest predictors of body fat percentage. However, the regression model is better than the decision model since it also indicates the impact that the other variables have on the body fat percentage of an individual. Additionally, the risk estimate for the decision tree model indicates a high probability of the model producing errors when predicting the body fat percentage. In comparison, the multiple linear regression model shows that it is a good fit for the data set. The accuracy measure of the model generated using the multiple linear regression model indicates that the model is valid. However, the validation of the decision tree model indicates a high possibility of errors and this supports the argument that the model generated using the multiple linear regression method is better than the model generated using the decision tree method. Despite this conclusion, the use of a decision tree is recommended when analyzing a large data set.

From the results of the study, the density variable is noted to have a major impact on the body fat percentage. The density variable in addition to the adiposity index, are some of the derived data that are present in the data set. A recommendation for future research is such variables should be ignored in the creation of a model to compute the body fat percentage. One of the main reasons for the recommendation to ignore derived data is that they are not easy to measure and are calculated from measured data. The aim of this study was to create a model that can be easy to use in various settings without the need to calculate the different variables for use in the model. An example is the different circumferences of various body parts included in the data can be easily measured. Another recommendation includes the addition of gender in the data set as part of the body fat percentage predictors. Various studies indicate that the body fat percentage varies between men and women. Therefore, the inclusion of gender as a variable can aid in determining the impact that gender has on the dependent variables. Alternatively, the generation of models that address either a data set made of male respondents and a data set made of female respondents can be useful. In the regression analysis, the variables identified as statistically insignificant were not eliminated. A recommendation is that such variables should be eliminated in the future when generating multiple linear regression predictor models.

References

Chatterjee, S., Chatterjee, P., & Bandyopadhyay, A. (2006). Skinfold thickness, body fat percentage and body mass index in obese and non-obese Indian boys. Asia Pacific Journal of Clinical Nutrition, 15(2).

Chiong, R., Fan, Z., Hu, Z., & Chiong, F. (2021). Using an improved relative error support vector machine for body fat prediction. Computer Methods and Programs in Biomedicine, 198(10574), 9.

Guerra, R. S., Amaral, T. F., Marques, E., Mota, J., & Restivo, M. T. (2010). Accuracy of Siri and Brozek equations in the percent body fat estimation in older adults. The journal of nutrition, health & aging, 14(9), 744-748.

Johnson, R. W. (2021). Fitting Percentage of Body Fat to Simple Body Measurements: College Women. Journal of Statistics and Data Science Education, 1-13.

Liang, Y., Abbott, D., Howard, N., Lim, K., Ward, R., & Elgendi, M. (2019). How effective is pulse arrival time for evaluating blood pressure? challenges and recommendations from a study using The mimic database. Journal of Clinical Medicine, 8(3), 337. doi:10.3390/jcm8030337

Tranmer, M., & Elliot, M. (2008). Multiple linear regression. The Cathie Marsh Centre for Census and Survey Research (CCSR), 5(5), 1-5.

Uçar, M. K., Uçar, Z., Uçar, K., Akman, M., & Bozkurt, M. R. (2021). Determination of body fat percentage by electrocardiography signal with gender based artificial intelligence. Biomedical Signal Processing and Control, 68, 102650.

Uçar, M. K., Ucar, Z., Köksal, F., & Daldal, N. (2021). Estimation of body fat percentage using hybrid machine learning algorithms. Measurement, 167, 108173.

Weiler, H. A., Janzen, L., Green, K., Grabowski, J., Seshia, M. M., & Yuen, K. C. (2000). Percent body fat and bone mass in healthy Canadian females 10 to 19 years of age. Bone, 27(2), 203-207.

Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied artificial intelligence, 17(5-6), 375-381.

Analysis of Body Fat Percentage