Theory of Data Mining.
In a previous blog entry ( 2.4. DataMining or data mining. ) try to make an initial approach to the theory of Data Mining. Data mining processes try to extract information hidden in the data using different techniques (mostly related to statistical and mathematical models in combination with application software).
Given the complexity of these techniques, not being committed to this Blog go into depth on this subject (because of time and knowledge), we will just see a couple of data mining methodologies, list the most common techniques and remember the concepts of three of these techniques through practical examples.These same examples will allow the further use of data mining tools provided by Microstrategy 9 (also included in the Microstrategy Reporting Suite) and explain that vision is the product of data mining techniques.
Before starting, I recommend watching the presentation Mining.Extracción Knowledge Data in Large Databases , conducted by José M. Gutierrez, Dept. of Applied Mathematics at the University of Cantabria, Santander.
For those who want or need to deepen the theory of data mining, techniques and possibilities, I leave the list of references to some of the most important books in this field:
- Data mining: Practical machine learning tools and techniques.
- Data Mining Techniques: For Marketing, Sales, and Customer Relationship.
Management, 2nd Edition - diction-Statistics/dp/0387952845/ref = sr_1_2? ie = UTF8 & s = books & qid = 1267124980 & sr = 8-2-spell "> The elements of statistical learning: data mining, inference, and prediction.
- Advanced Data Mining Techniques.
- Data Mining: Concepts and Techniques .
- Data Preparation for Data Mining .
Steps in a Data Mining project
There are several standard methodologies to develop the analysis in a systematic DataMining.Some of the best known are the CRISP, an industry standard that consists of a sequence of steps that are commonly used in a study of data mining. The other method is the SEMMA, specific to SAS. This method lists the steps in a more detailed. Let us look at what each one.
CRISP-DM (Cross-Industry Standard Process for Data Mining).
The model consists of 6 interrelated phases of a cyclical (with feedback). You can enlarge information on the methodology in the manuals section Dataprix.com .Also, you can access the project website Crisp here . The phases are:
- Business Understanding: understanding of the business including its objectives, evaluation of the current situation, setting targets to be met for data mining studies and developing a project plan. At this stage we define which is the object of study and because it raises. For example, a traveling salesman portal via web want to analyze their customers and buying habits to make targeting them and specific marketing campaigns for each target in order to increase sales.That will be the starting point of a data mining project. Detailed information on the phase Dataprix.com .
- Understanding Data: Once established the project's objectives, it is necessary to understand the data and determining the information requirements necessary to carry out our project. This phase may include data collection, description of them, exploration and verification of their quality. At this stage we can use techniques such as summary statistics (with variable display) or perform cluster analysis with the aim of identifying patterns or models within the data.It is important at this stage that clearly defined what we want to analyze in order to identify the information needed to describe the process and to analyze it. Then we need to see what information is relavant for analysis (because there are aspects that may be rejected) and then will verify that the identified variables are independent of each other. For example, we are in a project analysis data mining for customer segmentation. Of all the information available on our systems or from external sources should be identified which is related to the problem (customer data, age, children, income, area of residence) of all such information which is relevant (not interested, for example, the tastes of customers) and finally, the selected variables, check that there are interrelated (the level of income and area of residence are not independent variables, for example).The information usually is usually classified in Demographics (income, education, number of children, age), sociographic (hobbies, membership of clubs or institutions), transactional (sales, expenses on credit cards, checks issued, etc.). In addition, data can be of quantitative (measured data using numerical values) or qualitative (information determining categories, using nominal or ordinal).Quantitative data can be represented typically by some kind of probability distribution (we will determine how the data are scattered and clustered). For Qualitative previously will encode them to numbers that will describe the frequency distributions. Detailed information on the phase Dataprix.com .
- Data Preparation: Once the data sources are identified, they must be selected, cleaned, transformer to the desired shape and formatted. In this phase, to undertake the process of Data Cleaning and Data Transformation, needed for further modeling.In this phase you can perform data exploration deeper to find similar patterns within the data. If you are using a Data Warehouse as a data source, it has already performed these tasks to load the data. May also be the case we need to aggregate information (for example, build a sales period), we can extract information from our DW with the typical tools of a BI system. Another type of transformations can be converted to a range of values identifying value (income from / to determine the income category n), or reliza operations on data (to determine the age of a customer uses the current date and date of birth , etc.).In addition, each data mining software tool may have some specific requirements that require us to prepare the information in a format (such as Clementine or PolyAnalyst have different data types). Detailed information on the phase Dataprix.com .
- Modeling: modeling phase, we use specific software for data mining and visualization tools (formatting of data to establish relationships between them) or cluster analysis (to identify which variables are well combined.)These tools can be useful for initial analysis, which may be supplemented with rules of induction to develop the initial association rules and deepen them. Once we examine the knowledge of the data (often through recognition patterns obtained by viewing the output of a model), there may be other appropriate models of analysis of data (such as decision trees). At this stage we divide the datasets between learning and test. The tools allow us to generate results for various situations.In addition, the interactive use of multiple models allow us to delve into the discovery of the data. Detailed information on the phase Dataprix.com .
- Evaluation: the resulting model should be evaluated in the context of business objectives set out in the first phase. This can lead to the identification of other needs that may lead to return to previous stages for further (if we find, for example, a variable that affects the analysis but we have not taken into account when defining the data.) This will be an interactive process in which we will gain understanding of business processes as a result of visualization techniques, statistical techniques and artificial intelligence, to show the user new relationships between data, which will allow better understand the processes of the organization.It is the most critical phase, as we are doing an interpretation of the results. Detailed information on the phase Dataprix.com .
- Deployment: Data mining can be used either to verify previously defined hypothesis (we think that if we make a discount of 5% sales increase, but we have not checked with a model before implementing the measure), or to discover knowledge ( identify useful relationships and unexpected).This discovered knowledge can help us apply it to different business processes and implement organizational changes where necessary. For example, consider a typical example of mobile phone company that detects leaks in long-term customers by poor customer service. That aspect detected have performed organizational changes to improve that aspect. Changes can be applied to monitor, to verify in a given time correction or not, or if they have to be adjusted to include new variables. Also be important to document to be used as a basis for future studies. Detailed information on the phase Dataprix.com .
The six-step process is not a rigid model, where there is usually a lot of feedback and from previous phases.In addition, analysts will not have experienced the need for each phase in all studies.
SEMMA (Sample, Explore, Modify, Model and Assess).
In order to be properly applied, a data mining solution should be viewed as a process rather than as a set of tools and techniques. This is the aim of the methodology developed by the SAS Institute, called SEMMA, meaning sample = sample, explore = explore, modify = modify, model = ASSESS = modeled and evaluated. This method aims to make it easier to carry out exploration and statistical visualization techniques, select and transform the most significant predictive variables, model variables to predict outcomes and finally confirm the reliability of a model.Like Crisp model, it is possible feedback and return to previous stages in the process. The graphical representation is:
The phases are as follows:
- Sample: a large volume of information, extract a significant enough sample size and power suitable for agile handling.This reduction in the size of the data allows us to perform the analysis in a more rapid and also got crucial information from the data in a more immediate. The data samples can be classified into three groups according to the purpose for which used: Training (used to build the model), Validation (used for model evaluation) and Test (used to confirm and generalize the results of a model).
- Explore: in this phase of exploration the user searches for unexpected trends or anomalies to gain a better understanding of the data set. In this phase, both visually and numerically explored for trends or groupings.This exploration helps to refine and redirect the process. In the event that the visual analysis does not give results, explore the data using statistical techniques like factor analysis, correspondence analysis and clustering.
- Modify: this is where the user creates, selects and transforms the variables in order to put into building the model. Based on the findings of the exploration phase, modify the data to include information about the group or to introduce new variables that may be relevant, or remove those that really are not.
- Model: when we find a combination of variables that reliably predicts a desired outcome.At this point we are ready to build a model to explain patterns in the data. Modeling techniques include neural networks, decision trees, logistic models or statistical models as a series of time, memory-based reasoning, etc..
- Assess: In this phase the user evaluates the usefulness and reliability of the discoveries made in the process of datamining. Verify how well it works here a model. To do this, we apply it on different data samples (test) or other known data, and thus confirm their vaildez.
DataMining techniques
Statistical analysis:
Using the following tools:
1.ANOVA: o Analysis of Variance, to see whether there are significant differences between the measures of one or more continuous variables in different population groups.
2.Regresión: defines the relationship between one or more variables and a set of predictors of the first.
3.Ji squared tests the hypothesis of independence between variables.Main components: Reduce the number of observed variables to a smaller number of artificial variables, retaining most of the information on the variance of the variables.
4.Análisis cluster: To classify a population in a number of groups, based on profile similarities and dissimilarities between the different components of that population.
Discriminant 5.Análisis: A method of classification of individuals into groups that have previously been established, and find the rule that allows classification of the elements of these groups, and therefore identify the variables that best define the group membership.
Methods based on decision trees:
The method CHAID (Chi Squared Automatic Interaction Detector) is an analysis that generates a decision tree to predict the behavior of a variable from one or more predictor variables, so that the sets of the same branch and the same level are disjoint.It is useful in situations where the objective is to divide a population into different segments based on some decision criterion.
The decision tree is constructed by splitting the dataset into two or more subsets of observations from the values assumed predictors. Each of these subsets back then be partitioned using the same algorithm. This process continues until there are significant differences in the influence of predictive variables of these groups to the value of the response variable.
The root of the tree is the full data set, subsets and subsubconjuntos up tree branches.A set in which a partition is called node.
The number of subsets in a partition can go two to the number of distinct values that can take the variable used for separation. The predictor variable used to create a partition is the one most significantly associated with the response variable under test of independence of the Chi square on a contingency table.
Genetic algorithms:
Are numerical optimization methods, in which variable or variables that are intended to improve along with the study variables are a piece of information.Those configurations of the variables of analysis to obtain best values for the response variable, correspond to the segments with greater reproductive capacity. Through play, the best segments remain and their share grows from generation to generation. It can also introduce random elements for changing the variables (mutations). After a certain number of iterations, the population will consist of good solutions to the optimization problem.
Neural Networks:
Are generally numerical methods in parallel processing, in which the variables interact with linear or nonlinear transformations, to obtain a throughput.These outputs are compared with those who should have gone out, relying on test data, resulting in a feedback process by which the network is reconfigured so as to obtain a suitable model.
Fuzzy logic:
It is a generalization of the concept of statistics.Classical statistics is based on probability theory, turn it in joint technical, in which the relation of belonging to a set is dichotomous (the 2 is even or not). If we establish the notion of fuzzy set as one in which the membership has a certain level ("one day to 20 º C is hot?) We will have a broader statistical and so results are closer to human reasoning.
Time Series
Is knowledge of a variable through time, from that knowledge, and under the assumption that no structural changes will occur, to make predictions.Often based on a study of the series in cycles, trends and seasonality, which differ by the scope of time covered, for by obtaining the original series ended. Hybrid approaches can be applied to previous methods, in which the series can be explained not only in terms of time but as a combination of other variables more stable environment and, therefore, more easily predictable.
Classification of data mining techniques
Data mining techniques can be classified as Association, Classification, Clustering and Time Series Predictions.
- Association (Association): the relation between an item of a transaction and another item in the same transaction is used to predict patterns.For example, a customer purchases a computer (X) while buying a mouse (Y) by 60% of cases. This pattern occurs in 5.6% of purchases of computers. The association rule in this situation is that "X implies Y, where 60% is the confidence factor and 5.6% the support factor. When the confidence factor and support factor are represented by linguistic variables high and low, the association rule can be written in the form of fuzzy logic, such as "when the support bracket factor is low, X implies Y is high" . This would be a typical example of data mining to study the association between supermarkets selling baby diapers and beer (see blog entry Bifacil ).Algorithms used association rules and decision trees.
- Classification (classification) in the standings, the methods they intend to learn different features that classify the data into a predefined set of classes.Given a new predefined classes, a number of attributes and a set of training data or training, methods of classification can automatically predict the class of previously classified data. The more key issues related to the classification are the evaluation of the errors of classification and prediction power. The most used mathematical techniques for classification are binary decision trees, neural networks, linear programming and statistics. Using a binary decision tree, a tree induction model in the form Si-No, we can position the data in different classes depending on the value of its attributes.However, this classification may not be optimal if the power of prediction is low. Using neural networks, one can construct a model of neural induction. In this model, attributes are input layers and classes associated with the data is output layers. Between layers of input and output are a large number of hidden connections that ensure the reliability of the classification (as if they were the connections of a neuron with those around you). The neural induction model gives good results in many analysis data mining, when a large number of relationships complicates the implementation of the method for the large number of attributes.Using linear programming techniques, the classification problem is viewed as a special case of linear programming. Linear programming optimizes the classification of the data, but can lead to complex models that require large computation time. Other statistical methods such as linear regression, discriminant or logistic regression are also popular and frequently used in the classification process .
- Clustering (Segmentation): The cluster analysis of data without taking group and by using automated techniques makes the grouping of these.The clustering is not supevisado and requires no training data set. Shares a set of methodologies with the classification. That is, many of the mathematical models used in the classification can be also applied to cluster analysis. Using clustering algorithms and clustering sequence.
- Prediction (forecasting) / Estimate: prediction analysis is related to regression techniques.The main idea of predictive analysis is to discover the relationships between dependent and independent variables and the relationships between independent variables. For example, if sales is an independent variable, the benefición can be a dependent variable.
- Time Series (prognosis): using historical data together with techniques of linear or nonlinear regression, we can produce regression curves were used to make predictions for the future. Algorithms using time series.
Example 1. Analysis basket (Association).
Is a typical example used to explain the field of use of data mining (with the association between the sale of baby diapers and beer).In our case, using the examples provided by MicroStrategy in their platform, learning project, called MicroStrategy Tutorial, we will see an example of using association analysis techniques.
In the example, we analyze the sales of DVD's from a department store and try to find the association between sales of different movies. That is, try to find titles are sold together with the aim of establishing trade promotion then these films (eg, sale of packs, the location of the movies together in the hallways, discount promotion by buying the second unit, etc. ) with the aim of increasing sales.For this type of analysis used analysis of association rules.
Example 2. Customer segmentation (cluster analysis).
With this analysis we analyze our customers and using them demographic information (age, education, number of children, marital status or household type), make market segmentation to prepare the launch of certain products or making promotional offers.
In this case, we will conduct a cluster analysis using the algorithm k-means , which is the supporting Microstrategy.
Example 3.Sales forecast in a campaign (decision tree).
In this analysis using a decision tree to determine the response of a particular group of customers rebates on certain products in the era of back to school. To this end, decision trees use binary (remember that the decision trees can be used both for classification and for regression analysis, as in this case). Try to determine how they influence factors such as age, sex or number of children on the probability of shopping in the sales campaign.
In the next blog entry will detail these examples using the tools of Data Mining Microstrategy.