Abstract
There are two main factors that need to be considered when using fuel—ecology and economy. Ecologically, the fuels that are clean (fuel that emits less or no CO2) are more efficient than the ones that are not clean. Economically, such clean fuels are costly compared to their counterparts. The Indian Human Development Survey (IHDS-II) 2011–12 data set provides the usage details on six different types of fuel for over 42000 households in India. This paper shows the details of the requirements and processes taken to classify the data set based on the fuel usage variables. The results are obtained using machine learning techniques on the data set to determine the factors that are responsible for the use of clean fuel over non-clean fuel in households.
Keywords
1 Introduction
Fuel is an important constituent of the livelihood of the people. In the household sector, fuels are consumed for cooking, lighting, heating and other purposes. The major fuels consumed are crop residue, LPG, kerosene, electricity and other non-commercial fuels like firewood and dung cake. Unlike developed countries where electricity is dominating energy source, households in developing countries choose their fuel type based on the availability and socio-economic conditions. Especially, for cooking, it is estimated that around 30% of people rely on biomass [1].
The cost of consuming energy presents itself in different forms, mainly monetary and opportunity costs. The energy capacity of a fuel is defined by its calorific value. But the total potential of the fuel depends on two factors—economy and ecology. Economically, the fuel efficiency is the ratio of the calorific value to the amount spent in acquiring that fuel. Ecologically, the fuel efficiency depends on the amount of direct and progressive effect it has on the natural environment [2]. There are two main classifications of fuel in terms of their effects on the environment—clean and non-clean. This classification is based on the amount of impurities, mainly CO2 per unit of the fuel consumed. This paper takes six different fuels into consideration—firewood, dung cake, crop residue, kerosene, LPG and coal. According to sources, LPG [3] and kerosene [4] are classified as clean fuel and the others, coal [5], firewood [6], dung cake [7] and crop residue [8], as non-clean fuel.
This paper is organized as follows—Sect. 2 discusses the literature and theories that have been taken into consideration for the work. Section 3 provides a brief description of the IHDS-II data set. The methodology followed is described in Sect. 4 and the detailed results are presented in Sect. 5. Conclusions derived from the results are described in Sect. 6 along with approaches proposed for future research.
2 Literature Review
There are a handful of theories that depict the nature of household fuel consumption. The most popular among those are the energy ladder model and the energy stacking model. Some studies conclude the effectiveness of the energy ladder [2], while others report the greater efficiency of the energy stacking model over the former [9, 10], especially in the context of developing countries.
This theory puts forth a linear correlation between the household’s income and the type of fuel they use. Cleaner fuels, like LPG and kerosene, cost more compared to coal and other non-monetary sources of fuel like firewood and dung cake. This theory explains that households “switch” from using non-clean fuels to clean fuels as their income level increases and vice versa in order to meet their increasing requirements or decreasing savings [11]. This model puts forth a theory contesting the energy ladder that households do not completely switch their fuel preference with increase in income; instead, they use a combination of different fuel types for different purposes.
There are a number of studies that have worked in a similar area. A study in Africa [2] explores the households’ fuel choice within the context of the energy ladder hypothesis and tests for the effects of other factors such as asset ownership and the differences in fuel usage between urban and rural areas. A study in Arusha city, Tanzania [12], analyzed the socio-economic factors that are responsible for the urban households’ choice for cooking fuel and its toll on the households’ expenditure. The study revealed that the choice was influenced mainly by socio-economic and demographic factors like education, marital status, household size and occupation. A study in Pakistan [13] focuses on the effect of poverty on household fuel choice and its effects. The study reveals that the choice is affected not only by poverty but also by other factors like capital, asset ownership, etc. One such study in India [14] focuses on the Uttam Urja initiative that aims to promote cleaner energy options through development of value chains.
The concept of the energy ladder is closely connected with urbanization. The fuel or energy shifts are stimulated by an increase in monetary income mainly and also because of the availability of better and superior fuel sources locally. This shift in many cases becomes a status symbol also in the rural areas. From a purely economic point of view, the shift from a fuel of lower efficiency to more efficient fuel is good, apart from the wood saving which can be achieved if a major percentage of the population can shift away from fuelwood usage. This is based on the assumption that all the households are presented with a variety of choices of fuel types.
For a number of developing countries, this is very important from policy standpoint. Thus, a question arises in this respect, how far are the households free to choose between different energy sources? Are they actually able to exercise the choice between different fuels or is the progress heavily dependent on the economic status? Though studies in India have discussed the methods to promote actual use of cleaner energy in the future [14], no such study has been made that takes into account the actual rate and type of fuel consumption in India, as a whole. Even if a smaller portion of the country was to be taken as a representative, it is not necessary that the results would be befitting the entire nation. India, being a country of diversity not just in culture but also in its resources, will have a diversified usage rate.
As such considering, the country as a whole would be the better option to analyze the consumption pattern among the households. There are many studies that have been conducted, using a survey data as their information source. These studies have taken an unconventional approach of solving the problem [15]. Instead of the study based on the past literature work alone, they have used machine learning approach.
This study takes into account the differing nature of available resources, the nature of the settlements along with the apparent factors contributing to the consumption pattern—income and expenditure. Taking into consideration the similarities and differences between each of the models, a machine learning approach is used to determine the exact factors that contribute to the change in fuel consumption pattern in India, instead of expediting on an assumption that the factors are already known and constant.
3 Data Description
The study is done on the Indian Human Development Survey (IHDS-II) 2011–12 data [16]. It provides different sets of data for different categories like individuals, households, workers, doctors, villages, panchayats, etc. The survey includes 42,152 households in 1503 villages and 971 urban areas. Our focus is directed on the household data set of the survey which has a total of 42,152 records with 758 different variables [17]. From the previous literature on the subject, it is evident that only some of the fields from the data set are relevant to the research. Manual pruning of the fields produced 12 fields relevant to the study. The fields URBAN2011, INCOMEPC, CO21, POOR and FM1 have been selected with reference to previous literature and the other fields have been included in the study to provide a difference in opinion to the existing literature.
4 Methodology
The schematic diagram Fig. 1 represents the overview of the process. The proposed approach is divided into three phases: phase I—data preparation, phase II—training and testing and phase III—result analysis.
4.1 Phase I: Data Preparation
For the data to be used by the machine learning algorithm, it has been free from any discrepancies or other errors present in it. For cleaning, record elimination method has been implemented because records containing missing values are negligible compared to the size of the whole data set (~0.032%). As mentioned earlier, the data consists of six different fields that contain fuel consumption data. However, for a smooth process, it is best if there exists a single distinctive class label. As such, the first step is to derive a single unique class label from the collection of these six fields. Table 1 provides the description of the fields that are taken into consideration for predictors and Table 2 provides the description of the fields considered for the preparation of the class label.
For this process, two temporary fields clean and non-clean are created, which contain the amount of clean and non-clean fuel consumed by the households, respectively. This is done by changing the value of each of the fields into 1 and −1 each representing whether the household has used that particular fuel or not. Then a sum of each fuel field in the particular category (clean fuel and non-clean fuel) is taken as the values for the temporary fields. Since there are four non-clean fuels and two clean fuels, it is better to first bias the fields by dividing the “non-clean” by four and “clean” by two to get a uniform representation.
Next, we combine these two fields into one single class label. This is done by comparing the numeric values of the fields and the new class label is formed. Table 3 provides a description of the categories in the new class label. For example, a household where the value of the clean variable is greater than that of the non-clean variable is categorized as Clean Favoured and so on for each household in the data.
Now, there is a distinct class label which is named “USAGE” with independent categories to work with. This way it is easier to analyze based on the households’ inclination to one type of fuel instead of using two different variables and analyzing the data using each of the new fields as a class label. The frequency of each class determines the number of households that follow that particular fuel consumption pattern. A higher value implies a higher number of households following the pattern and vice versa. Since the consideration is given for fuel usage statistics and the fact that the frequency of the class is too low as compared to the others, the class Fuel not used has been eliminated from the entire data. As such, a total of 40797 records have been retained as the base for building the model with five different categories present in the class label.
4.2 Phase II: Training and Testing
During this phase, the prepared data is put for testing to determine how much of it is actually contributing to the classification of the data based on the class label. This step is done in three stages, handling the class imbalance, feature selection and building the model.
Handling Class Imbalance
After the errors and inconsistencies are removed from the data, there may be a problem with the distribution of its records over different classes. In such cases, the class with the highest frequency is called the majority class and the one with the least frequency, the minority class. In these situations, most classifiers are biased towards the majority class. Hence, they produce poorer results than when working on balanced data [18]. One technique for resolving this problem is the resampling technique [19]. This method involves two processes—oversampling (where samples are taken into account more than once) and undersampling (where some samples are ignored). The other technique is the Synthetic Minority Oversampling Technique (SMOTE) [20] process where new records are synthesized from existing records instead of creating duplicate copies of the present records. In cases where only one class has a different (high or low) spread compared to others, one of these processes can be used. But, in present scenario, where the difference in spread is not limited to a single class, the combination of the two techniques is used to obtain better results.
Feature Selection
The data contains fields that have been pruned manually based on previous literature. But not all of those fields will be statistically relevant in predicting the class label. This necessitates a selection process to identify which fields matter in predicting the class label.
The CFS [21] evaluation (correlation-based subset evaluation) is one such technique which chooses the fields that have a high correlation value with the class label and have less inter-correlation between each other. That is, it selects the features that provide good enough information about the data to classify it based on the class label. Furthermore, the selected features themselves might still be ranked based on the amount of information they provide to classify the data. This helps to determine which feature of the data is more suitable to determine the value of the class label as compared to others.
The result of the cleaning process produced a total of 40830 records out of 42,152. From Table 2, it is evident that the classes are not evenly spread. After a trial and error process of using these techniques in conjecture, one method proved to be more suitable as compared to the others. Here, the SMOTE and resampling technique has been used in conjecture, by first increasing the minority class to 300% its original size using SMOTE, then applying the resampling technique over the resultant data using the uniform class biasing to even the spread of records through simultaneous oversampling and undersampling. The final distribution shows an equal spread of 9555 records per category. After this process, the data is subjected to the feature selection process using the cfsSubsetEval filter provided by Weka.
The infoGain or gainRatio filter has been used to determine the ranking of the selected features. These fields represent the ones that are statistically relevant in classifying the data based on the class label, i.e. these are the fields that can actually provide information necessary to classify a household based on the factors responsible for following that particular pattern. The features selected by the CFS subset evaluation technique and their corresponding ranks determined by the infoGain filter showed, in decreasing order of rank, INCOMEPC (1), STATEID, ID14, CO21, NPERSONS, URBAN2011, CGVEHICLE, ID18C, MG1 (9). Higher rank implifies higher information contributed to classification by said feature and vice versa, with the highest rank corresponding to INCOMEPC and the lowest to MG1.
Building the model
The data mostly consists of categorical fields with two or more categories in each field. Only two fields, namely INCOMEPC and CO21, consist of numeric data. As such, a decision tree model would better suit this classification problem as compared to other techniques, because there is no necessity for any numerical analysis. Hence, the random tree [22] classifier is used to build and test the model. Random tree is a straightforward generative classifier. It is not an iterative process. The tree is generated after the first read and hence has a very low building and testing time. While many machine learning algorithms tend to iterate over the initial process to obtain an optimal solution, the random tree classifier works towards generating a tree whose attributes are based solely on the data. The training and testing of the model have been done using a machine learning tool called Weka that contains in-built filters to handle class imbalance, feature selection and model building.
4.3 Phase III: Result Analysis
The overall efficiency of the model is determined based on its accuracy (the fraction of records that have been correctly classified by the model). Higher accuracy implies a higher probability for each of the records to be correctly identified. For example, a model with an accuracy of 1 (100%) will correctly classify each and every record in the data. The individual results of the class label are analyzed based on the precision (fraction of selected records that belong to the respective class), recall (the fraction of records belonging to the class that has been retrieved) and f-measure (the weighted average of precision and recall) [23].
5 Experimental Results
This section provides the results obtained by the implementation of the random tree model over the data set. The data preparation phase has been handled with the help of the R interactive console. The results produced an accuracy value of 82.8864%, i.e. at least 82 records out of a 100 would be correctly classified by the model. Figure 2 shows the distribution of the precision, recall and the f-measure for each category in the class label along with their average. As seen in the figure, the recall (ability to identify a record correctly) for each of the categories is varied—some above the mean (only non-clean, 0.906; non-clean favoured, 0.89) and some much below the average (clean favoured, 0.695).
This implies that the records in the categories of higher recall were easily identified correctly by the classifier as compared to those with the lower recall. This is due to the amount of relevancy that each of the fields holds with respect to the class label in that particular category, i.e. the easily classified categories can be explained better in terms of their predictors as compared to the other categories. The differences in the recall measures can be explained by a number of factors—incorrect data, inconsistent data format, insufficient correlation between the fields have with respect to the class label, etc. The variation in the recall measures can be directly related to the varying relevancy of the fields as there are no data discrepancies that can contribute to this variation.
For example, the class with the highest f-measure of 0.901 (only non-clean) is better explained by the information provided by the selected features as compared to a class with a lower f-measure, say clean favoured, which has a value of 0.726. Though this value is not too low to be discredited, it is lower than that of the other classes. Based on the results, it is easier to identify households that use only non-clean fuel as compared to other categories. At the same time, it is a bit tedious to correctly identify if a particular household favours clean fuel over non-clean fuel with higher precision. This implies that the direct relation that exists between the fields and the class label is not constant but actually dependent on the class label itself, i.e. for a category with high recall values, the relationship is stronger between the fields and the class label as compared to the ones with lower recall. The f-measure, being directly dependent on recall and precision, follows a similar rise and fall in correspondence to its independent factors.
6 Conclusion and Future Work
The data that has been used is highly reliable—96.78% records were retained after the cleaning and reduction processes but the data is imbalanced. From the values in Table 2, it can be asserted that the fuel usage in India is not equally distributed among different categories.
The values show that number of people who prefer clean fuel (LPG and kerosene) is greater compared to the number of people who prefer equally distributed usage which is in turn greater than the number of people who prefer non-clean fuel (firewood, ding cake, crop residue and coal). The features selected by the CFS algorithm show a total of eight different fields that are statistically relevant in classifying the data. This shows that, despite the popular belief that the fuel consumption pattern is determined by the economic status of the households, there are other factors that contribute to the determination of the nature of fuel consumption. These factors are described by the respective fields selected during the subset selection process. These include the ones that describe the socio-economic status of the households, like the settlement nature of the household (URBAN2011) and the expenditure on fuel consumption (CO21), among other factors such as the region of settlement (STATEID), number of residents (NPERSONS), etc.
This analysis enhances the existing theories that explain the economic factors are responsible for determining the fuel consumption nature of the households. But the findings of this work suggest that it is not the only factor that affects the nature of fuel consumption but it is one among many others. Although this work provides an extended view of the consumption nature of households in developing countries, it is based on the assumptions that the households taken into consideration represent all the households from the same country or state. As such, further works may include not only the data from a single country, but from multiple countries, or across multiple time periods of the same households, to monitor the changes in their fuel consumption pattern over the time period. Further, works could also take into consideration, the individual fuel rates in the particular region, and other factors like availability, taxation, policy laws and other factors that directly or indirectly affect the nature of fuel consumption and observe their effects on the problem.
References
International Energy Agency. 2006. World Energy Outlook.
Toole, R. 2015. The Energy Ladder: A Valid Model for Household Fuel Transitions in Sub-Saharan Africa?
A. F. D. Center. Propane Fuel Basics. U.S. Department of Energy.
Lam, N.L., K.R. Smith, A. Gauthier, and M.N. Bates, Kerosene: A Review of Household Uses and Their Hazards in Low- and Middle-Income Countries. Journal of Toxicology and Environmental Health Part B.
Coal, The fuel of the future, unfortunately. The Economist, April 16, 2014.
Ecofriendly Alternatives to Burning Wood in your Fireplace. Scientific American.
Pant, K.P. 2010. Health Costs of Dung-Cake fuel use by the poor in rural Nepal. Kathmandu: South Asia Network of Economic Research Institues.
Crop Residue, Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Crop_residue.
Mekkonen, A., and G. Kohlin. 2009. Determinants of Household Fuel Choice in Major Cities in Ethiopia. Working Papers in Economics.
Ado, A., I.R. Darazo. 2016. M. A. 7 (3).
Heltberg, R. 2003. Household Fuel and Energy use in Developing Countrie—A Multicountry Study. The World Bank: Washington, DC.
Thadeo, S.M. 2014. Economic of Urban households’ Cooking Fuel Consumption in Arusha City, Tanzania. Morogoro, Tanzania: Sokoine University of Agriculture.
Nasir, Z.A., F. Murtaza, and I. Colbeck. 2015. Role of Poverty in Fuel Choice and Exposure to Indoor Air Pollution in Pakistan. Journal of Integrative Environmental Sciences 107–117.
Rehman, I.H., A. Kar, R. Raven, D. Singh, J. Tiwari, R. Jha, P.K. Sinha, and A. Mirza. 2010. Rural Energy Transistion in Developing Countries: A Case of the Uttam Urja Initiative in India. Environmental Science & Policy 13 (4): 303–311.
Dominic, V., D. Gupta, S. Khare, and A. Agrawal. 2015. Inverstigation of Chronic Disease Correlation Using Data Mining Techniques. In 2nd International Conference on Recent Advances in Engineering and Computational Sciences, Chandigarh.
Narendranath, S., S. Khare, D. Gupta, and A. Jyotishi. 2018. Charateristics of ‘Escaping’ and ‘Falling into’ Poverty in India: An Analysis of IHDS Panel Data Using Machine Learning Approach. In 7th International Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept 2018. IEEE.
Indian Human Development Survey. [Online]. Available: https://ihds.umd.edu/.
Longadge, R., S.S. Dongre, and L. Malik. 2013. Class Imbalance Problem in Data Mining: Review. Internation Journal of Computer Science and Network.
Brownlee, J. 2018. Need Help with Statistics? Take the FREE Mini-Course. Machine Learning Mastery. [Online]. Available: https://machinelearningmastery.com/statistical-sampling-and-resampling/.
Chawla, N.V., K.W. Bowyer, L.O. Hall, and P.W. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research.
Hall, M.A. 1999. Correlation-based Feature Selection for Machine Learning. New Zealand: University of Wakaito.
Random Forest. [Online]. Available: https://en.wikipedia.org/wiki/Random_forest#Algorithm.
Joshi, R. 2016. Accuracy, Precision, Recall & F1 Score: Interpretation of Performance Measures. Exsilio Solutions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shyam Sundar, K., Khare, S., Gupta, D., Jyotishi, A. (2020). Analysis of Fuel Consumption Characteristics: Insights from the Indian Human Development Survey Using Machine Learning Techniques. In: Raju, K., Govardhan, A., Rani, B., Sridevi, R., Murty, M. (eds) Proceedings of the Third International Conference on Computational Intelligence and Informatics . Advances in Intelligent Systems and Computing, vol 1090. Springer, Singapore. https://doi.org/10.1007/978-981-15-1480-7_30
Download citation
DOI: https://doi.org/10.1007/978-981-15-1480-7_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1479-1
Online ISBN: 978-981-15-1480-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)