Archive for November, 2017

Thoughts on assuring the quality of VGI (Goodchild and Li, 2012)

Thursday, November 9th, 2017

I think that the most important thing to note from Goodchild and Li’s article on assuring the quality of VGI is that his proposed approaches are only applicable to VGI that is “objective and replicable.” This is to say that he is discussing VGI which attempts to capture the truth of a particular geographic phenomena (such as contributions to OpenStreetMap), rather than VGI which references an individual’s particular experience in geographic space (such as a volunteered review of a tourist location). I don’t intend for this post to devolve into a discussion on the nature of scientific “truth” and “fact”, but it is definitely interesting to think about the extent to which any type of VGI (and any type of geographic fact, I suppose) can truly be objective. All volunteered information is subject to the bias of its contributor.

I would have liked for this article to also address the challenges in defining “accuracy” for VGI that is purely objective, rather than fact-based. When we are talking about things like a restaurant review on Yelp or a woman reporting the location of an incidence of sexual assault, what does “accuracy” mean? A restaurant review might be inaccurate in the sense that it could be fabricated by a reviewer who never actually went there, but this is nearly impossible to identify. Perhaps it is the intent of the contributor that is the most important in examples like this (ie. does the reviewer have malicious intent against the particular restaurant?), but underlying intent is still incredibly opaque. Perhaps this is a topic for further class discussion…

Ester et al 1997 – Spatial data mining

Sunday, November 5th, 2017

The broad goal of knowledge discovery in databases (KDD) is, fittingly, to construct knowledge from large spatial database systems (SDBS). This goal is achieved via spatial data mining methods (algorithms) which are used to automate KDD tasks (e.g. detection of classes, dependencies, anomalies). Without a fuller understanding of the field at present, it is hard to judge how comprehensive an approach is outlined in Ester et al’s (1997) paper.

The authors underline the distinguishing characteristics of spatial databases; namely, the assumption that an object’s attributes may be influenced by the attributes of its neighbours (Tobler). These assumptions motivate the development of techniques and algorithms which automate the identification and extraction of spatial relationships. For instance, a simple classification task can be executed by algorithms that group objects based on the value of their attributes. The authors present a spatial extension of this approach, by incorporating not only an object’s attributes, but also those of its neighbours, allowing for greater insight into spatially classified sets of objects within a SDBS.

Contrasting with last week’s topic, the approach to knowledge extraction here emphasises automation. The goal is to construct basic rules that can efficiently manipulate and evaluate large datasets to detect meaningful, previously unknown information. Certainly, these techniques have been invaluable for pre-processing, transforming, mining and analysing large databases. In light of recent advances, it would be interesting to revisit these techniques to assess whether new spatial data mining methods are more effective for guessing or learning patterns that may be interpreted as meaningful, and to consider the theoretical limits of these approaches (if they exist).
-slumley

Spatial Data Mining: A Database Approach, Ester et al. (1997)

Sunday, November 5th, 2017

Ester et al. (1997) propose basic operations used for knowledge discovery in databases (KDD) for spatial database systems. They do so with an emphasis on the utility of neighbourhood graphs and neighbourhood indices for KDD. When the programming language began to bleed into the article it was clear that maybe some of the finer points would be lost on me. I was reminded of the discussion of whether or not it’s critical that every concept in GIScience is accessible to every GIS user. I’m convinced that in order for GIS users to practice critical reflexivity in their use of queries within a database, they ultimately need to understand the fundamentals of the operations they utilize. After making it through the article, I can say that Ester et al. could explain these principles to a broader audience reasonably well. I’ll have to echo the sentiments of previous posts that it would have been interesting to see more discussion of this, but perhaps it’s beyond the scope of this article.

Maybe it’s because we’re now into our 9th week of GIScience discourse, but I felt that the authors did a particularly good job of situating spatial data mining–which, despite its name, might appear more closely related to the field of computer science at a glance–within the realm of GIScience. Tobler’s Law even makes an appearance on page 4! It’s an interesting thought that GIScientists might have more to contribute to computation beyond the handling of explicitly spatial data. For instance, Ester et al. point to spatial concept hierarchies that can be applied to both spatial and non-spatial attributes. You can imagine how spatial association rules conceived by spatial scientists might then lend themselves the handling of non-spatial data as well.

On Ester et al (1997)’s Spatial Data Mining in Databases

Sunday, November 5th, 2017

In their article “Spatial Data Mining: A Database Approach” (1997), Ester et al outlined the possibility of knowledge discovery in databases (KDD) using spatial databases, utilizing four algorithms (spatial association, clustering, trends, and classification). Unfortunately, the algorithms are not entirely connected to how one mines spatial information from databases, and the algorithms introduced don’t seem incredibly groundbreaking 20 years later. This paper seemed very dated, particularly because I feel like most of these algorithms are now tools in ESRI’s ArcGIS and the frameworks behind GeoDa, and because the processing issues that seemed to plague the researchers in the late 1990s are not issues (on the same scale) today.

Also, I found it strange that the paper adopted an incredibly positivist approach, and did not mention anything about how these tools could be applied in real life. They acknowledged this as a point of further research in the conclusion, but weighted it less heavily than the importance of speeding up processing times in ‘90s computing. In their introduction, the authors discuss their rationale for using nodes, edges, and quantifying relationships using Central Place Theory (CPT). However, they do not mention that CPT/theorizing the world as nodes & edges is an incredibly detached idea that 1) cannot describe all places, 2) does not realize that human behaviour is inherently messy and not easily predictable by mathematical models, and 3) only identifies trends and cannot be used to actually explain things, just to identify deviances from the mathematical model. Therefore, not everything can be identified by a relationship that a researcher specifies to scrape data using an inherently flawed model, and therefore there will be inaccuracies. It will be interesting to learn if/how spatial data miners have adapted to this and (hopefully) humanized these processes since 1997.

Data base approach to spatial data mining (Ester and al.)

Sunday, November 5th, 2017

Spatial data mining consists of the use of database information and manipulation through algorithms to process spatial information as effectively as possible. it is able to use available information to infer other pieces of information through dependancy between variables. Thus, it can relate to aspects of spatial privacy by using personal information voluntarily provided to determine additional information about people that might otherwise not be divulged (or areas for the use of this paper).

To be upfront, spatial data mining is a topic that I was rather intimidated to look into. Since I have a basic understanding of computer science and was confused by the majority of the more technical information presented in the paper. However, I thought the paper did a good job at conveying how the concepts are used and why they are applied; I understood the logic behind the algorithms and how information is mined. effectively, i believe that the paper caters to a wide audience thanks to its combination of technical and conceptual information.

The article explicitly covers the basics of spatial data mining; basic operations and concepts used in the area of study. This raises the question “what are the complex and advances methods of spatial data mining?”. If this paper was written in 1997, the field has probably made considerable advances since and what new methods might be on the horizon. However for the purpose of this article, the basics were very well introduced to provide a range of readers to learn about the field of spatial datelining though knowledge discoveries in databases.

Shekhar et al – Spatial Data Mining

Sunday, November 5th, 2017

This paper presented the primary tools in which to affect data mining on a set of data. The tangible results found as a result of data mining were not new to me, I believe it is something that many budding GI scientists engage with at the beginning of their education. I remember engaging with learning and training data from other classes, typically in the form of geolocating.
I found that the hidden data sets emerging from these analyses poses a very interesting insight into our epistemology of data sets. With learning and training data it seems that we’re engaging with a very basic form of machine learning. I am intrigued by the opportunities this faces with a more open form of data. I can imagine that with more open data sources, the machine learning aspects could learn from other data sets and gain more insight within hidden data. I wonder if our treatment of data and rights will come into discussion in the future. I’d be interested in know in what forums these conversations are taking place.
As a whole, all of these techniques seem to provide a very valuable tool. To extrapolate meaning from disparate forms of data, such as by clustering, determining outliers and figuring out co-locational rules can be an extremely insightful tool for a lot of disciplines in the social and physical sciences. Taking a rudimentarily psychological lens, I find it interesting how much of these techniques assume a behaviouralist understanding of spatial processes, in which they interact in rational ways with each other as part of a greater whole. The fact that they take interest in outliers seems to factor in the irrationality of some processes. I would also be interested in knowing where the research on that is headed.

Spatial Data Mining (Shekhar et al)

Saturday, November 4th, 2017

I found this paper particularly tough to get into, as Spatial Data Mining veers more towards a tool used in G.I.S. than any of the topics we have covered thus far in my opinion. Although the tweaking of methods like SAR and MRF models to meet the issues regular data mining ran into (i.e. ignoring spatial auto-correlation, and inferring spatial heterogeneity) is a sign of tool building, I still find this topic in GIScience to be very technical and definitely in the tool realm of G.I.S. Furthermore, many of the clustering techniques mentioned (i.e. K-means) have been around for years now, and have been accepted as the standard in most regular G.I.S. projects, making me ask the question “what makes spatial data mining so special?”. Is it simply the size of the data being mined, and the unsupervised aspect of it? As this paper cites papers from 1999 & 2000 on spatial data mining’s ability to work with large amounts of data back then, I wonder how well spatial data mining works with big data, and how the validation process and statistical analysis of this would work today.

Although this paper focuses on the uses of spatial data mining and the raster dataset, I wondered that if this technique were used to go over vector data possibly including personal information (i.e. age or phone number) and tied this to space to look for ‘hidden patterns’, this would definitely be a violation of privacy.

All in all, although this field seems quite complex, it also seems very simple in that it embodies all of the basic algorithms used in traditional GIS projects, though on a larger scale.

-MercatorGator

 

Thoughts on Shakhar et al. (2003)

Saturday, November 4th, 2017

Shekhar et al. (2003) outline various techniques in spatial data mining which can be used to extract patterns from spatial datasets. In discussing techniques for modeling spatial dependency, detecting spatial outliers, identifying spatial colocation, and determining spatial clustering, Shakhar et al. effectively demonstrate the relevant challenges and considerations when working with a spatial dataset.  Due to factors such as spatial dependency, and spatial heterogeneity, “general purpose” data mining techniques will perform poorly on spatial datasets and new algorithms must be considered (Shekhar et al., 2003).

Shekhar et al. define a spatial outlier as a “spatially referenced object whose non-spatial attribute values differ significantly from those of other spatially referenced objects in its spatial neighbourhood” (p 8). I have not previously been exposed to research on spatial outliers, but I was surprised to read such a definition in which an outlier is determined by its non-spatial attribute. I am left wondering if it is possible to invert Shekhar’s definition and define spatial outliers in terms of differences in spatial attribute values among objects with consistent non-spatial attribute values. For example, when talking about the locations of bird nests, could we define a spatial outlier as a nest which is significantly far from a cluster of other nests?

As this article was broadly speaking about knowledge discovery from spatial datasets, I was reminded of last week’s lecture on geovisualization. While the objective approach of spatial data mining contrasts the exploratory geovisualization process, I am curious how the two approaches can effectively be combined to drive a more holistic process of knowledge discovery from spatial data.

Spatial Data Mining – Ester, Kriegel, Sander (1997)

Friday, November 3rd, 2017

Tobler’s Law of Geography is central to spatial data mining. The purpose of knowledge discovery in databases is to identify clusters of similar attributes and find links with the distribution of other attributes in the same areas. Using decision tree algorithms, spatial data systems and their associated neighborhood graphs can be classified, and rules can be concluded from the results. The four generic tasks introduced in the beginning of the article are not addressed later on. Identifying deviation from an expected pattern is presented as central to KDD as well, but an algorithm for this doesn’t appear to be discussed.

The article remains strictly concentrated on the implications of KDD algorithms on spatial database systems and computer systems. Little relation is made to non-spatial database systems, even though many of the algorithms presented are based on non-spatial decision-tree algorithms.

I’m sure that patterns can be detected in human attributes of nodes in a social network. Since distance along an edge is so crucial to spatial classification, do non-physical edges quantified in other ways perform similarly in the creation of human “neighborhoods”? When patterns are deviated from, can conclusions be drawn as easily about social networks?

“Neighborhood indices” are important sources of knowledge that can drastically reduce the time of a database query. Creating spatial indices requires some knowledge of a spatial hierarchy. Spatial hierarchies are clear-cut in political representations of geography. As pointed out in the article, often the influence of centers (i.e. cities) is not restricted to political demarcations. These algorithmically created neighborhood indices may present interesting results to urban planners and geographers, who often have difficulty delineating the extent of influence of cities. beyond their municipal borders.

 

Spatial Data Mining: Shekhar, Zhang, Huang and Vatsavai (2003)

Friday, November 3rd, 2017

The article by Shekhar, Zhang, Huang and Vatsavai (2003) begins with a clear explanation of the differences between spatial and non-spatial data mining, with some interesting examples. It would have been useful to include some of the information from last week’s article from geoviz about the prevalence of spatial information in digital data (~80%) for context, especially given the link between geoviz and data mining made at the end of the article. The article then goes on to list different statistical phenomena and methods, with clear examples which was helpful for context and keeping the text engaging.

The section I found most interesting, and which I think Allen will focus on during his research is clustering. One thing that was not mentioned in the article and which I wonder about, is the role of scale in spatial clustering, especially with large data sets. If you’re looking for spatial clusters, won’t scale play a big role in determining the clusters, ie. something might seem like a small cluster, but at a smaller scale, it is part of an even larger cluster. Using Allen’s research project of taxi ridership in NYC as an example, I would imagine that certain areas of Manhattan will have high instances of taxi ridership, but at a smaller scale,  Manhattan as a whole would be an area of taxi ridership clustering. I wonder how the choices of scale and data granularity in analysis lead to different results, and whether it is useful to run analysis at different spatial scales.

 

Thoughts on Spatial Data Mining Chapter (Shekhar et al.)

Thursday, November 2nd, 2017

This chapter provided a review of several spatial data mining techniques, example datasets, and how equations can be adapted to deal specifically with spatial information. In the very beginning, the authors state that to address the uniqueness of spatial data, researchers would have to “create new algorithms or adapt existing ones.” Immediately, I thought about how these algorithms would be adapted; would the inputs be standardized to meet the pre-conditions of non-spatial statistics? Or would the equations themselves be adapted by adding new variables to account for differences in spatial data? The authors address these questions later in their explication of the different parts of the Logistic Spatial Autoregressive Model(SAR). 

When discussing location prediction, the authors state that “Crime analysis, cellular networks, and natural disasters such as fires, floods, droughts, vegetation diseases, and earthquakes are all examples of problems which require location prediction.” (Shankar et al. 5/23) Given the heterogeneity and diversity in these various data inputs, I was wondering how any level of standardization is achieved in SDA, and how interoperability is achieved when performing the same operations on such different data types. 

What I gathered from this chapter was that there is considerable nuance and specificity within each SDM technique. Given the diversity of applications for each technique, from species growth analysis to land use change, to urban transportation data, the choice of attribute that is included in the model greatly influences the subsequent precision of any observed correlation. (See example of Vegetation Durability over Vegetation Species for Location Prediction example) 

There was a clear link between SDM and data visualization, as illustrated by the following statement about visualizing outliers; “ there is a research need for effective presentations to facilitate the visualization of spatial relationships while highlighting spatial outliers.” Clearly, there is overlap between accurate spatial models and the  effective presentation of that data for the intended audience. 

-FutureSpock