Data Mining History (DM) – Capstone Seminar in Engineering and Society

Data mining refers to the process of discovering patterns and relationships in large data sets (Big Data). Analysing Big Data using new technologies and data models can help uncover new correlations, turning them into helpful insights. Data Mining is blended with many different topics such as statistics, data science, artificial intelligence, machine learning and database theory. (1) Data Mining hardly exists on its own and refers to the larger process of Knowledge Discovery in Databases (KDD). KDD defines the overall process of discovering useful knowledge from data, including steps such as; data selection, preprocessing, transformation, data mining and interpretation/evaluation. Data mining is merely a step in the KDD process that refers to the application of specific algorithms that are used for extracting and modelling data. Contextualizing data mining as being part of the KDD process

The idea of data mining and KDD may seem new given popular media on topics like Big Data and Artificial Intelligence, however, these practices have existed for much longer than you may think. Early practices of data mining converge with the formation of many statistical analyses such as Bayes’ Theorem (1700’s) and Regression analysis (1800’s) that were mostly used for identifying patterns in data. The term data mining has actually had negative connotation when first introduced to the field in the 1960s. In From Data Mining to Knowledge Discovery in Databases, Fayyad asserts:

“The term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were first introduced. The concern arose because if one searches long enough in any data set (even randomly generated data), one can find patterns that appear to be statistically significant but, in fact, are not. Clearly,this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct relevance to KDD. Thus, data mining is a legitimate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical aspects of the problem) is to be avoided.” (Fayyad, 1996)

Here, it is highlighted that many researchers had drawn valid criticism of data mining given it’s atypical approach to research and knowledge discovery. Conducting research that is data driven like data mining approaches and not hypothesis driven like traditional statistical approaches make data mining vulnerable to drawing inaccurate or insignificant conclusions. This leaves the necessary warning of doing it “correctly” and to avoid disregarding statistical aspects of your problem.

The criticism of data mining techniques from those who study statistics raised concerns. It raises the question as to how accurate these new technologies really are and what implementation of them makes the most sense. Correlation does imply causation, and in the online world, this becomes especially true. Big Data is composed of “found data” which refers to the observation of data that is collected through online profiles, website traffic or any large-scale source of human activity and information. (McFarland, 2015) “Found data” is very different from that of which is used in statistically designed experiments. Statistical experiments more often place an emphasis on data collection principals and having a more diverse pool of information than typical data mining practices might call for. Data collected from online web pages often contain many types of biases that could skew their data. Having a larger set of data has the ability to make any findings appear more statistically significant. This is because variance, the statistical measure of the standard deviation of a mean, is closely tied to the number of entries a data set has. The more entries, the smaller the variance, and the more statistically significant a large set of data may seem. (McFarland, 2015) These misunderstandings may also lay at the hands of the researcher who may wrongly think that having a large data set means that it is a representative collection of the entire population, when this is not the case. Data collected from online and social websites often have the ability to overrepresent some user groups and even include false profiles that aren’t even people at all.

Algorithms that rely on Big Data have the potential to become wildly inaccurate given a change in the data received. Relying on human patterns and interaction to predict or measure your data is subject to large changes given the unpredictability of human interactions online. In McFarland’s commentary on “Big Data and the danger of being precisely inaccurate”, he mentions a data model that was developed to predict flu cases in the United States. This model used 50 million Google search terms to predict weekly flu cases. At its creation, in 2009,this model was seen as an excellent use of Big Data and even started reporting flu cases more precisely than the Center for Disease Control. Because this algorithm was calibrated to specific search terms that were widely used in 2009, the data model became inaccurate only 2 years later in 2011. The model was actually seen to be over reporting by almost double the amount of actual cases. Small changes in search terms by the greater Google population was able to swing the data model making it grossly inaccurate in predicting flu cases. (McFarland, 2015) Analyses that are reliant on Big Data are subject themselves to the predictability of human behavior which is hard when the person’s relationship with technology is ever changing.

Though these technologies have the ability to point out patterns and correlations from a given set of data, data mining alone does not have the ability to learn and apply knowledge. The work of this is often done by the researcher analyzing these large data sets or even Machine Learning algorithms that employ principles of data mining in order to learn how to replicate the data it’s been given. In 1959, Arthur Samuel coined the term “machine learning” to mean “computer’s ability to learn without being explicitly programmed”. Now, machine learning refers to a set of programmed algorithms that analyse input data to predict an output field within an acceptable range. As more data gets fed into machine learning algorithms, they learn “intelligence” and are able to become more “accurate” over time. Machine learning algorithms are implemented to automate many processes where a decision is generated. Having autonomous systems such as these can become dangerous given how easy it is to skew these algorithms by feeding it bias or misrepresenting data.

The pitfalls of data mining don’t only exist when analyzing entries from Big Data sources. Even data that is generated under the strict rules of statistically designed experiments have their own pre-existing biases. When data is collected from humans, it holds the same biases as those involved in the sampling. Data alone will often tell you the values and intrinsic biases of the sampling population. Using imperfect data to automate processes will result in a replication of what has been done.

For more on socio-technical-systems please click here.