An overview of Data Mining and its Applications to Environmental Problems

Amolak Singh
14 min readApr 8, 2021

Note: In 2016, I wrote this report for an independent study course I took at the University of Minnesota. It discusses environmental applications of data mining. Below, I have left the content of the original essay as it was originally written with the exception of some small formatting modifications.

Data mining is a complex process by which users of data attempt, in an automated fashion, to discern systematic patterns and relationships between variables in large data sets in order to gain some sort of meaningful and actionable insight.

The applications of data mining are far reaching and include two important, yet distinct, domains: scientific research and business analysis. While data mining in business and finance has developed quite well and found a lot of success, it has struggled in scientific fields such as climate science (Faghmous and Kumar, 2014).

Due to the development of new, always improving technology, the amount of available environmental data is unmanageable, necessitating innovative data mining techniques and algorithms (Faghmous and Kumar, 2014). Additionally, environmental data can be tricky to deal with due to the two main versions of data used in research today: 1) recently collected, complex, and multidimensional data, and 2) older, less complex data that may have significant gaps (Noyes, 2016).

To understand the effects of climate change, researchers, governmental agencies, NGOs, and private corporations are working on research to develop innovative big data techniques which can glean meaningful and actionable information (Gurin, 2015).

Rather than focusing on the specific details of data mining algorithms, this report will discuss the importance of using big data to understand climate change, the difficulties with analyzing complex environmental data, explore in detail two specific applications of innovative data mining techniques to environmental questions, and examine the implication of these new insights on environmental data science.

Why Should We Use Big Data to Understand Climate Change?

There is no doubt that climate change is occurring and that the effects of climate change have the potential to be unpleasant, and even devastating. The warming of the earth’s climate has brought upon many adverse effects not only to natural ecosystems, but also to humanity and society. Increasing temperatures have brought rising sea levels, changing weather patterns, and an increase in the severity and quantity of severe weather events that annually causes billions of dollars of infrastructure damage in the United States alone; these concerns have brought climate change to the forefront of international politics and diplomacy (Guo et al, 2015).

For example, rising sea levels are a great cause for concern due to the immediate effect they have on people and property in coastal communities across the planet. Data visualization of the rising sea levels based on accurate predictions can spur the relevant governments to enact relevant policy (Lehmann, 2015).

Another poignant example is water stress, which, by 2025, is expected to impact at least five billion people, potentially forcing water to become the prized resource of the future just as oil is currently (Arnell, 1999). In addition, climate change has other not so obvious deadly effects, such as exacerbating the spread of communicable diseases to unexpected locations whose residents are not immune to the new diseases (Vora et al, 2012).

Environmental problems are global and require assistance from statistics, mathematics, and computer science for modeling and understanding significant changes in earth’s ecosystem (Das and Srinivasan, 2009). Research using climate change data has three major purposes: to increase international engagement and collaboration, to understand and allow for adaptation to climate change, and to mitigate and/or delay the predicted effects of climate change (Gurin, 2015).

In addition, learning about climate change through non-traditional methods has the potential to “reveal large-scale processes and features that are not observable via traditional methods” (Guo et al, 2015). However, plugging data into an algorithm is not enough; proper interpretation of the results by climate scientists and thoughtful policy decisions based on interpretations from climate scientists is key to delaying and even mitigating the effects of climate change.

Challenges of Climate Data

In order to understand, analyze, and interpret data, it must first be processed. This is one of the fundamental challenges which climate big data research is attempting to address. With the massive amounts of climate data being collected from different sources, such as satellite sensors or in situ observations, doubling every two years, and with the data becoming increasingly complex and multidimensional, data scientists can no longer rely on traditional methods of data analysis and must develop advanced forms of data mining, both supervised and unsupervised (Han et al, 2002). Guo et al label climate data has having what they call “4V features (volume, variety, veracity, and velocity)” (2015).

Computational techniques such as parallel computing have been able to break down incredibly large problems into smaller, more manageable ones thus dealing with the large volume of data (Drake, 2016). The second issue is the lack of homogeneity in climate data. More often than not modern climate data “integrates the work of scientists in meteorology, oceanography, hydrology, geology, biology and other established fields,” which ensures the rarity of data sets that focus on long-term trends; environmental data often “comes from many different sources, with different formats, resolutions and qualities” (Kobielus, 2014; Spate et al, 2006).

Even when data is relatively unified, the dimensionality of the data must be taken into consideration. Most, if not all, environmental processes occur in at least two or three spatial dimensions, in addition to having a time component; within this framework, “multiple factors are acting at many different spatial and temporal scales,” often in an unstructured and non-linear way (Spate et al, 2006). Finally, environmental data is prone to error or simply limited in at least one dimension. For example, many data sets may only go back a couple of decades, may contain many missing values, or may not fully capture and thus give an incomplete view of the climate system’s behavior (Faghmous and Kumar, 2014; Spate et al, 2006).

The next two sections of this report will look at two specific ways, which I found particularly fascinating, in which researchers attempted to address these challenges of environmental data and the insights that resulted from their work.

Analyzing Climate Data Using Complex Networks (Steinhaeuser et al, 2015)

In the past, climate scientists have used various clustering techniques in order to gain insight into regions with similar climatic patterns; however, these techniques are limited because can only do one of the following two things:

  1. They consider only a single time period of multivariate data, or
  2. They consider only a single variable by using the time series data as multidimensional vector (Steinhauser et al, 2015).

To address these shortcomings, Steinhauser et al implemented graph-based algorithms to understand multivariate climate data rather than the traditionally used clustering or regression methods, which can lose important information.

In this model, rather than clustering based on univariate similarity or spatial proximity, Steinhauser et al use a climate network model in which physical locations are represented by nodes, and and “a cross correlation-based measure of similarity [is used] to create weighted edges” (2015).

The relationship among multiple climate variables determines edge placement, which is not subject to spatial constraints, and a detection algorithm discovers clusters which correspond to climate regions; the development of this network allowed researchers to track “communities,” which are groups of related, yet physically separate locations, thereby capturing complex relationships and patterns that span both space and time (Steinhauser et al, 2015).

This graph-based technique implemented by Steinhauser et al includes three major steps:

  1. Defining a cross correlation-based measure of similarity between different locations.
  2. Constructing the climate network by creating and weighting edges.
  3. Eliminating extraneous edges to discover and track the relevant networks, i.e. the communities.

To construct a climate network, Steinhauser et al had to create a measure of similarity. However, they could not use traditional measures such as Euclidean and Mahalanobis distances because those measures are ineffective “for high-dimensional and noisy data, or when clusters of varying density are known to exist within the data” (Steinhauser et al, 2015).

To avoid these issues, Steinhauser et al proposed to define similarity based on the quantity of nearest neighbors two points share. To do this, they took time series for four climatic variables (air temperature, pressure, relative humidity, and precipitable water), and using a cross correlation function, they compared the time series of different nodes and calculated the correlation values between the 4 variables which ranged from -1 to 1 (-1 meaning a perfectly inverse relationship, 0 meaning no relationship, and 1 meaning a perfectly inverse relationship).

This process resulted in a measure of similarity between two grid cells which was the Euclidean distance in six dimensions; thus, the interaction between variables at each location, rather than the behavior of a single variable, defined the strength of the relationship between different regions (Steinhauser et al, 2015).

Once Steinhauser et al developed the above similarity measure, they needed to cluster the grid locations into regions based on climatic similarity. Instead of applying k-means or a related clustering method, which would not resolve the issues mentioned earlier, the researchers had to construct the networks by “dividing the time series into 5-year windows, so that twelve separate networks would be available to study changes in structure over time,” and then by letting each grid cell be a node in the network, with the edge weighted by the correlation calculated above, resulting “in a fully connected network…consisting of over ten thousand nodes and more than 55 million edges” (Steinhauser et al, 2015).

To define the fundamental structure of the network, Steinhauser et al removed all of the edges except for the strongest 1% of correlations, leaving “over a half million edges intact…[and] twelve climate networks for analysis” (Steinhauser et al, 2015). To find communities within the twelve climate networks, Steinhauser et al chose to use an algorithm called “WalkTrap” because it was computationally efficient and was able to consider weighted networks based on the idea that “random walks in a network are more likely to remain within the same community than to cross community boundaries.”

The results of these methods were fascinating. As a result of developing the climate networks, Steinhauser et al found communities that displayed not only direct relationships (i.e. the locations were very similar), but also communities that displayed inverse relationships where the correlation was close to -1.

For example, one of the communities discovered was formed by three regions: Southeast Asia, South America, and Africa. These regions either belong to the “Tropical Wet-Dry (South America, Africa) or to the Monsoon (South-East Asia) climate zones,” and this community was probably formed by “the strong inverse relationship (negative correlation) between the hydrological patterns in the two regions– extremely dry conditions versus intense rainfall/monsoons during the summer months” (Steinhauser et al, 2015).

This approach to developing clusters of similar climatic regions is useful for two reasons: it preserves patterns in the climate data that span both space and time, and it finds otherwise not so obvious relationships between different locations across earth.

It would be interesting to do this process again, except by choosing variables other than air temperature, pressure, relative humidity, and precipitable water to see how the results might change, and the new insight that might be gained. Finding regions of climatic similarity, and therefore a similar interest in preserving their environment and natural resources, would go a long way to fight the effects of climate change while giving a reason for countries to unify.

Anomaly Detection of Global Climate System (Das and Srinivasan, 2009)

Das and Srinivasan took a different data mining approach, and attempted to detect spatial, temporal and spatiotemporal abnormalities in the global climate system. Their idea was to find outliers from data that consisted of fifty years of daily global air temperature and precipitation measurements (1950–1999) that came from a variety of sources including satellite sensors, airborne cameras and weather radars, which “accrued terabytes of temporal, spatial and spatio-temporal data.” This resulted in challenges such as a high amount of computational complexity due to the astronomical amount of data, algorithmic complexity due to spatial dependency, and complexity due to non-linear correlations between processes (Das and Srinivasan, 2009).

First, it is important that the types of outliers are defined. Das and Srinivasan define spatial outliers to be geographical areas whose climate measurements are significantly different from those of other objects in its spatial neighborhood, while temporal outliers are time periods which experience changes in the historical trends.

On the other hand, a spatio-temporal outlier is one whose climate behavior is significantly different, both spatially and temporally, from referenced objects in its neighborhood (Das and Srinivasan, 2009). The goal of their research is to find days and locations where there is a significant difference of temperature and precipitation measurements when compared to other similar locations and days.

In order to analyze the fifty years of daily global air temperature and precipitation measurements data, Das and Srinivasan had to process their data in order to reduce the dimensionality of the dataset, which allowed their outlier detection algorithm to work efficiently on the reduced space.

Starting with a 18262 × 18048 matrix, Das and Srinivasan reduced it to an 18264 × 2 matrix by taking the L2 (Euclidean) norm of each day’s temperature and precipitation measures. Each location was a row and the columns stood for the two variables: air temperature and precipitation.

For detecting the more complex spatio-temporal outliers, they selected “the sub portion of interest from each of the original data [in the] matrix such that consecutive locations constituting the local spatial neighborhood are along rows and the temporal neighborhood is along column,” after which the matrix was converted to a n×m spatial grid with latitude as the rows and longitude as the columns and the cells containing the temperature or precipitation measurement at the relevant location (Das and Srinivasan, 2009).

Since the dataset was not labeled, Das and Srinivasan used an unsupervised anomaly detection method which determines outliers based on nonparametric measures such as distance and density. Das and Srinivasan argue that since outliers are events with very low probabilities, it is difficult to compute the exact probability therefore requiring data mining techniques based on the concept of nearest neighbors; the anomalous data values are those with a higher distance. Das and Srinivasan chose “a near-linear time distance-based outlier detection algorithm” (2009).

Traditional data mining algorithms have difficulty mining spatial datasets since spatial objects are affected by spatial processes in the neighborhood; for example, climate behavior at a given location is often influenced by climate conditions nearby (Das and Srinivasan, 2009). Instead, spatial dependency along with the consideration of a local temporal neighborhood helps to identify spatiotemporal outliers.

Using this approach, Das and Srinivasan computed a similarity score for each cell of the data matrix by considering its “r-point neighborhood.” If the score of a cell is significantly different from the other scores in the grid, it is exhibiting spatially anomalous behavior, if the scores of some cells in the spatial grid are significantly different for a couple of consecutive timestamps, it is exhibiting spatially anomalous behavior, while a spatially anomalous unit is deemed to be a spatio-temporal outlier if it deviates from historical trends (Das and Srinivasan, 2009).

The results of Das and Srinivasan are interesting. For example, they found in their analysis of the temporal outlier that three time periods had a significant/sudden change, and therefore were labeled as anomalous: 1950–1952, 1980–1984, and 1996–1998 (Das and Srinivasan, 2009). While Das and Srinivasan argue that their results are “reasonably aligned with the anomalies cached in the historical records of global climate data,” it seems odd that they do not pick up on the global increase in temperature that accelerated beginning in the 1970s.

There are two possible reasons:

  1. Their bar for the definition of anomaly was set too high, thereby missing other anomalous time periods generally associated with climate change, or
  2. The increase in global temperature was steady, so the algorithm did not pick up the changes as anomalous. Interestingly, most of the years which were labeled as anomalous coincided with La Niña or El Niño years.

In terms of spatial outliers, the areas with the most data points labeled as outliers were the region enclosed by the equatorial South Pacific Ocean, and parts of South America, the area enclosed by parts of Alaska and Canada, and Antarctica; these are locations which are also impacted the most of La Niña or El Niño events (Das and Srinivasan, 2009).

In addition, these seem to be areas which are also impacted by overall trends in climate change: lengthening summers in North America and more ice melt in Antarctica seem to be reflected in Das and Srinivasan’s results. Das and Srinivasan discovered an overlap between the spatio-temporal outliers and the spatial/temporal anomalies; however, they did discover that the spatio-temporal outliers were able to monitor the anomalies that were otherwise hard to pick up (2009).

Concluding Statements

Researchers have applied a variety of data mining and big data techniques to discover patterns in climate data and understand climate change better. This research is critical for understanding climate change as it is becoming one of the most important issues and possibly the most difficult challenge of the next century.

Throughout my research, I looked through dozens of journal articles which used various techniques in supervised and unsupervised learning which attempted to analyze climate data. This report only looks at two approaches, which the author found interesting, in detail: one which used clustering via complex networks to determine regions of similarity, and another approach which used a nearest-neighbor approach for discovering spatial, temporal and spatio-temporal anomalies.

Steinhauser et al used a graph-based approach to represent climate data, which they argued was “a more intuitive model that can be used for analysis while also having a direct mapping to the physical world.” They then used a cross-correlation function to build the network, which allowed them to detect and track regions of similarity or dissimilarity, i.e. “communities.” They showed that communities have a climatological interpretation and that “disturbances in structure can be an indicator of climate events (or lack thereof)” (Steinhauser et al, 2015).

Das and Srinivasan used a nearest neighbor data mining algorithm to discover anomalies among the spatial, temporal and spatio-temporal earth science data. After reducing the data dimensionality, they detected outliers which both shows abrupt changes in the global climate over the years, and explains extreme events such as drought and severe rainfall at certain locations.

Arnell, Nigel W. 1999. Climate change and global water resources. In Global Environmental Change. Vol. 9. Elsevier Ltd.

Das, Mahashweta, and Srinivasan Parthasarathy. 2009. Anomaly detection and spatio-temporal analysis of global climate system. Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data: 142–150.

Drake, John B. “Predicting Climate Change.” Oak Ridge National Laboratory. Available online at http://web.ornl.gov/. Last accessed September 1, 2016.

Faghmous, James H, and Vipin Kumar. 2014. A Big Data Guide to Understanding Climate Change: The Case for Theory-Guided Data Science.Big Data 2, no. 3: 155–163.

Guo, Hua-Dong, Li Zhang, and Lan-Wei Zhu. “Earth observation big data for climate change research.” Advances in Climate Change Research 6, no. 2 (2015): 108–117.

Gurin, Joel. “Fighting Climate Change: The Ultimate Data Challenge.” Huffington Post, December 23, 2015. Available online at http://www.huffingtonpost.com/. Last accessed September 1, 2016.

Han, Jiawei, Russ B. Altman, Vipin Kumar, Heikki Mannila, and Daryl Pregibon. “Emerging scientific applications in data mining.” Communications of the ACM 45, no. 8 (2002): 54–58.

Hansen, M.C., P.V. Potapov, R. Moore, M.. Hancher, S.A. Turubanova, A. Tyukavina, D. Thau, et al. 2013. High-resolution global maps of 21st-century forest cover change. Science (New York, N.Y.) 342, no. 2013: 850–3.

Kobielus, James. “Data Science’s Limitations in Addressing Global Warming.” IBM Big Data & Analytics Hub, September 25, 2014. Avaiable online at http://www.ibmbigdatahub.com/blog/. Last accessed September 1, 2016.

Lehmann, Evan. “Can Big Data Help US Cities Adapt to Climate Change?” Scientific American, March 20, 2014. Available online at http://www.scientificamerican.com/. Last accessed September 1, 2016.

Noyes, Catherine. “Big Data’s Biggest Challenge: Climate Change.” Fortune, June 23, 2014. Available online at http://fortune.com/. Last accessed September 1, 2016.

Spate, Jessica, Karina Gibert, Miquel Sànchez-Marrè, Eibe Frank, Joaquim Comas, Ioannis Athanasiadis, and Rebecca Letcher. “Data Mining as a tool for environmental scientists.” (2006).

Steinhaeuser, Karsten, Nitesh V. Chawla, and Auroop R. Ganguly. 2010. An exploration of climate data using complex networks. ACM SIGKDD Explorations Newsletter 12, no. 1: 25.

University of Minnesota. “Expeditions in Computing: Understanding Climate Change-A Data-Driven Approach.” Available online at http://climatechange.cs .umn.edu/. Last accessed September 1, 2016.

Vora U., A. Vakharwala, P. Chomal and M. Sutar, “Mining environmental data for prediction of transmission patterns of communicable diseases,” 2015 17th International Conference on E-health Networking, Application & Services (HealthCom), Boston, MA, 2015, pp. 582–585.

--

--