Uncategorized

Download PDF Data Mining and Statistics for Decision Making

Free download. Book file PDF easily for everyone and every device. You can download and read online Data Mining and Statistics for Decision Making file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Data Mining and Statistics for Decision Making book. Happy reading Data Mining and Statistics for Decision Making Bookeveryone. Download file Free Book PDF Data Mining and Statistics for Decision Making at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Data Mining and Statistics for Decision Making Pocket Guide.

Within this scope, many decision support models are still being developed in order to assist the decision makers and owners of organizations. It is easy to collect massive amount of data for organizations, but generally the problem is using this data to achieve economic advances. There is a critical need for specialization and automation to transform the data into the knowledge in big data sets.

Data mining techniques are capable of providing description, estimation, prediction, classification, clustering, and association. Recently, many data mining techniques have been developed in order to find hidden patterns and relations in big data sets.

It is important to obtain new correlations, patterns, and trends, which are understandable and useful to the decision makers. There have been many researches and applications focusing on different data mining techniques and methodologies. In this study, we aim to obtain understandable and applicable results from a large volume of record set that belong to a firm, which is active in the meat processing industry, by using data mining techniques.

In the application part, firstly, data cleaning and data integration, which are the first steps of data mining process, are performed on the data in the database. With the aid of data cleaning and data integration, the data set was obtained, which is suitable for data mining. For example, when data are collected via automated computerized methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands or more of predictors.

The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees , generalized linear models , or general linear models become impractical when the number of predictors exceed more than a few hundred variables. Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone.

Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent outcome variables of interest, for further analyses with any of the other methods for regression and classification. Machine Learning Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining , to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining and machine learning is usually on the accuracy of prediction predicted classification , regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation.

Good examples of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting , etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in crossvalidation samples.

Meta-Learning The concept of meta-learning applies to the area of predictive data mining , to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking Stacked Generalization. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics e.

Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method e.

Use Of Data Mining Techniques In Advance Decision Making Processes In A Local Firm

The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier s can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy. Models for Data Mining In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization.

In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. One such model, CRISP Cross-Industry Standard Process for data mining was proposed in the mids by a European consortium of companies to serve as a non-proprietary standard process model for data mining.

This general approach postulates the following perhaps not particularly controversial general sequence of steps for data mining projects:. Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities.

This model has recently become very popular due to its successful implementations in various American industries, and it appears to gain favor worldwide. All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks. It can equally well be integrated into ongoing marketing research, CRM Customer Relationship Management projects, etc. Predictive Data Mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest.

For example, a credit card company may want to engage in predictive data mining, to derive a trained model or set of models e. Other types of data mining projects may be more exploratory in nature e. Data reduction is another possible objective for data mining e. Stacked Generalization See Stacking. Stacking Stacked Generalization The concept of stacking Stacked Generalization applies to the area of predictive data mining , to combine the predictions from multiple models.

In stacking, the predictions from different classifiers are used as input into a meta-learner , which attempts to combine the predictions to create a final best predicted classification. Other methods for combining the prediction from multiple models or methods e. Text Mining While Data Mining is typically concerned with the detection of patterns in numeric data, very often important e.

Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of multiple text documents by extracting key phrases, concepts, etc. Voting See Bagging. To index. StatSoft defines data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management systems, using designated technology suitable for corporate data base management e.

Note that despite its name, analyses referred to as OLAP do not need to be performed truly "on-line" or in real-time ; the term applies to analyses of multidimensional databases that may, obviously, contain dynamically updated information through efficient "multidimensional" queries that reference various types of data. OLAP facilities can be integrated into corporate enterprise-wide database systems and they allow analysts and managers to monitor the performance of the business e. The final result of OLAP techniques can be very simple e.

Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by OLAP to provide more in-depth and often more multidimensional knowledge. As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables e. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns.

Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets. Basic statistical exploratory methods. The basic statistical exploratory methods include such techniques as examining distributions of variables e.

Multivariate exploratory techniques. Multivariate exploratory techniques designed specifically to identify patterns in multivariate or univariate, such as sequences of measurements data sets include: Cluster Analysis , Factor Analysis , Discriminant Function Analysis , Multidimensional Scaling , Log-linear Analysis , Canonical Correlation , Stepwise Linear and Nonlinear e. Neural Networks. Neural Networks are analytic techniques modeled after the hypothesized processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations on specific variables from other observations on the same or other variables after executing a process of so-called learning from existing data.

A large selection of powerful exploratory data analytic techniques is also offered by graphical data visualization methods that can identify relations, trends, and biases "hidden" in unstructured data sets. Those relations between variables can be visualized by fitted functions e.

For example, one of many applications of the brushing technique is to select i. The exploration of data can only serve as the first stage of data analysis and its results can be treated as tentative at best as long as they are not confirmed, e. The data mining agent is responsible for creating and manipulating of data mining models. It performs data cleansing and data preparation, provides necessary parameters for data mining algorithms and creates data mining models through executing data mining algorithms.

It assimilates the results from data mining agent, generates a report based on predefined templates and performs output customization. Holsheimer introduced the Data Surveyor application system in his papers Holsheimer, ; Holsheimer et al. He did not emphasize the functionalities it offers to the user. Instead, he put emphasis on the implementation of data mining methods and the interaction between Data Surveyor and database systems. Author describes Data Surveyor as a system designed for the discovery of rules.

It is a 3-tier application system providing customized GUI for two organization roles: the data analyst and the end user. Heindrichs and Lim have done research on the impact of the use of web-based data mining tools and business models on strategic performance capabilities. His paper reveals web-based data mining tools to be a synonym for data mining application system.

Data Mining Statistics Decision Making by Stphane Tuffry

The author states that the main disadvantage of data mining software tool approach is the fact that it provides results on a request basis on static and potentially outdated data. He emphasizes the importance of the data mining application system approach, because it provides ease-of-use and results on real-time data.

The author also discusses the importance of data mining application systems through arguing that sustaining a competitive advantage in the companies demands a combination of the following three prerequisites: skilled and capable people, organizational culture focused on learning, and the use of leading-edge information technology tools for effective knowledge management. Data mining application systems with no doubt contribute to the latter. We are going to introduce them in the following sections.

The first step within the pre-development activities was the platform selection. Platform selection was highly influenced by two important factors. The first factor was the dominant presence of Oracle platform in our GSM operator. ODM has two important components. Before finally accepting Oracle 9i and ODM, an evaluation sub-project was initiated. The aim of the project was to evaluate ODM, i. It was the first version of ODM and evaluation was simply necessary to reduce the risk of using an immature product.

The evaluation was performed through recreating data mining models on domains of some past projects which were performed by our research group using data mining software tools Kukar, ; Kukar et al. The models acquired by past projects and the results acquired by ODM were compared and evaluation gave positive results for the verification of ODM. Another advantage of Oracle 9i is the security issue.

As opposed to many other data mining platforms, in case of Oracle 9i data mining, data does not leave the database. Data mining models and their rules are stored in the database, which means that database security provides the control over access to data mining data, i. Both activities are extremely interrelated, because the process model implies the functionality of an application to a great extent.

It turned out that DMDSS directly or indirectly needs three roles: a data administrator, a data mining administrator and a business user. A data mining administrator role should be granted only to users with advanced or at least above-average knowledge of data mining methods and concepts. Business users are business analysts responsible for performing analysis in various business areas.

The roles of the data mining administrator and the business user must be supported. Data administrator role and data preparation phase will not be supported by DMDSS, they should be supported by other tools. This would prevent business users from using functionalities which demand advanced knowledge of data mining. The data mining administrator should have the possibility to create, evaluate and delete models. A set of model statuses should be defined in order to enable the administrator to make only good and useful models available for the business users.

The data mining administrator should have the possibility to comment on the models and insert them in the database. Business users should have the possibility to see them, which would help them understand and interpret the models better. There should be various visualization and representation techniques available in order to enable various methods for model presentation for the business users. Before the training of business users there should be a data mining tutorial organized, where they could learn the concepts of data mining, which would enable them to use and truly exploit DMDSS.

The key issue for the success of DMDSS is to define its functionalities in the way that will enable data mining administrator create and evaluate models. On the other hand, business users should be able to use it effectively with as little data mining knowledge as possible. The key pre-development activity was to determine the data mining process model for DMDSS, which would be appropriate for analysts in marketing department of our GSM operator. According to the level of their knowledge of data mining concepts it was obvious that DMDSS process model should enable analysts incorporate it in their decision process.

The analysis of CRISP-DM and other previously introduced data mining process models revealed that they are more appropriate for ad-hoc projects and a data mining software tool approach than for a data mining application system approach. The consequence was that none of them could be directly used for DMDSS and data mining application system approach.

Statistics for decision making Ch 1 introduction

The first stage was the execution of business understanding phase, where the aim was to discover the domains with continual need for repeated analysis based on data mining methods. They are referred to as the areas of analysis. The second stage was the execution of a data mining project for each area of analysis using a data mining software tool approach.

The development process is introduced later on in the chapter. The aim of executing multiple iterations of all CRISP-DM phases for every project was to achieve improvements in the areas of data preparation and to do the fine-tuning of data mining algorithms used in ODM API through finding proper parameter values for algorithms.

Data sets were re-created automatically every night, based on the current state of the data warehouse and transactional databases. After the re-creation of data sets, data mining models were created and evaluated. It was essential to do iterations over longer period of time in order to implement automated procedures for data preparation and monitor the level of changes in data sets and data mining models acquired. One of the demands for DMDSS was the ability for daily creation of models for every area of analysis and for that reason the degree of changes in data sets and data mining models acquired were monitored.

Multiple iterations performed in the second stage assure the stability of data preparation phase and proper parameter value sets for data mining algorithms for modelling phase. Modelling and evaluation are performed by data mining administrator and deployment by business users. DMDSS was developed by using several diagramming techniques. Entity relationship diagrams were used for data modelling. For several reasons we decided to use iterative incremental process model. As already mentioned, one of the reasons for multiple iterations was to achieve improvements and stability in the areas of data preparation and modelling.

After the iteration had been finished, the functional testing was performed. Only then the analysis of functionalities were conducted done and based on that, the list of changes and improvements. The list of changes and improvements was used as the list of demands for the next iteration of development. Such a division of the development team was efficient, because the development process could be carried out consequently to a certain extent and developers could be grouped according to their areas of specialization and skills.

Then we are going to introduce concepts of the use and functionalities of DMDSS through some example forms for data mining administrator and business user. The introduction of concepts of use and functionalities is done for classification data mining method supported by DMDSS.

DMDSS supports role-aware menus. Every role has its own role-aware menu which enables the access only to its dedicated modules. Every DMSDSS user is granted one of the following roles: the data mining administrator, the business user and the developer. The last one was introduced for administrative and maintenance purposes.

DMDSS allows the developer to maintain the catalogue of areas of analysis. The catalogue of areas of analysis is a group of database tables having the following advantages:. The lists of areas of analysis are built dynamically, based on the current catalogue contents.

This is used in the building of menus and lists of values. The name of the training set, attribute names and the name of classification attribute only for classification are stored in the database. The translations of keywords used in models are stored in the database.

Through the use of translating of the keywords if, then, in, … the model presentation can be adopted and changed without changing of DMDSS program code. In order to achieve higher flexibility, every area of analysis has its own keyword translations. These advantages clearly reveal the flexibility of DMDSS for the introduction of new areas of analysis without changing the program code.

The approach with the catalogue of the areas of analysis stored in the database ensures efficient maintenance process. All these parameters are also stored in the database. The information support provided for the roles of data mining administrator and business user will be reflected in the following part of the chapter. The classification method was chosen to be a test area for the concepts of GUI and the use of DMDSS and the first four development iterations were dedicated only to classification.

For the purpose of area of analysis customers are ranked into three categories: a good customer, an average customer and a bad customer. The aim of the area of analysis is to acquire the customer model for each customer category. This information enables business users to monitor characteristics of a particular customer category and plan better marketing campaigns for acquiring new customers. Within the DMDSS application additional areas of analysis for the purposes of mobile phone sales analysis, customer analysis and vendor analysis were also investigated.

The data mining administrator can create classification models by using model creation form Figure 1. When creating the model they input a unique model name and a purpose of model creation. Beside that there are four algorithm parameters to be set before the model creation. The user can choose the value for each parameter from the interval which was defined as proper in the second stage of process model.

At the bottom of the form there are recommended values for parameters to acquire a model with fewer or more rules: default settings for fewer rules in a model, and settings for more rules in a model. Model testing is performed automatically as the last phase of the model creation.

Artificial neural network

Model testing is an evaluation process to perceive the quality of the model through using machine learning methods. After the model creation, a data mining administrator can view and inspect the model. Model viewing is supported by two visualization techniques. As already mentioned, keywords used in rules are translated in order to present the rules in a language more appropriate for the users.

The second technique is decision trees, where classification rules are converted into decision trees showing equivalent information as rules. The decision trees technique is a graphical technique, which enables visual presentation of rules and for that reason it is very appropriate. While viewing and inspecting, the administrator can input comments for the model. As already mentioned, the role of the comments is to help the business users to understand and interpret the models better. A data mining administrator can change the status of a model to a published status if the model quality reaches a certain level, and if the model is different from the previously created model of particular area of analysis.

Business users can view only the models with published status. Business users have access to a fewer functionalities than the data mining administrator. The form for model viewing for a business user Figure 2 is slightly different from the form for model viewing for data mining administrator, but has similar general characteristics. Business users can also view rules in both visualization techniques as the data mining administrator. On the other hand, the form also enables access to some general information about the model: creation date, purpose of model creation, etc.

The form also enables business users to view comments of a model written by the data mining administrator. DMDSS has now been in production for several months. During the first year of production there will be supervising and consultancy provided by the development team. Supervising and consultancy have the following goals:. The role of data mining administrator will be supervised by the data mining consultant form development team, having expertise and experience in data mining.

The employee responsible for that role has enough knowledge, but not enough experience yet. Supervising will mainly cover support at model evaluation and model interpretation for data mining administrator and business users;. They use patterns and rules identified in models as the new knowledge, which they use for analysis and decision process at their work. According to their words they have already become aware of the advantages of continual use of data mining for analysis purposes.

Based on the models acquired they have already prepared some changes in marketing approach and they are planning a special customer group focused campaign, based on the knowledge acquired in data mining models. The most important achievement after several months of usage is the fact that business users have really started to understand the potentials of data mining. Suddenly they have got many new ideas for. The list of new areas of analysis will be made in several months, and after that it will be discussed and evaluated.

Selected areas of analysis will then be implemented and introduced to DMDSS according to methodology introduced in the chapter. The experience of the use of DMDSS has also revealed that business users need the possibility to make their own archive of classification rules. They also need to have an option to make their own comments to archived rules in order to record the ideas implied and gained by the rules.

The future plan for classification model utilization is also to apply the model on new customers in order to predict the category a new customer potentially belongs to. These enhancements are planned to be implemented in the future. While designing and developing DMDSS and monitoring its use by the business users we have been considering and exploring the semantic contribution of the use of a data mining application system like DMDSS in a decision process and performing any kind of business analysis. For that reason one of our goals of the project was also to illustrate the semantic contribution of the use of DMDSS in decision processes.

We decided to use the concept of data-model for that purpose. A meta-model shows domain concepts and relations between them. In this case the meta-model describes a decision process on the conceptual level with emphasis on demonstrating the contribution and the role of the use of DMDSS as data mining application system Figure 3. UML class diagrams were used as technique for the meta-model. Decision support concepts are represented as classes and relations between them are represented as associations and aggregations.

Concepts and relations, which in our opinion represent a contribution of the use of DMDSS in the decision process, are represented in a dotted line style. The meta-model shows various concepts that influence the decision process and represent a basis for a decision. Information technology engineers often believe that decisions mostly depend on data from OLAP systems and other information acquired from information systems. It is true that they represent a very important basis for the decision, although in more than a few cases decisions mostly depend on factors like intuition and experience Bohanec, Knowledge is in our opinion probably the most important basis for the decision, because it enables the correct interpretation of data, i.

The contribution of the use of DMDSS and models and rules it creates is in contribution to the accumulation of the knowledge acquired by models and their rules. A detailed description of decision process and creation of a detailed meta-model is beyond the scope of the chapter. DMDSS is a data mining application system which enables a decision support, based on the knowledge acquired from data mining models and their rules.

The mission of DMDSS is to offer an easy-to-use tool which will enable business users to exploit data mining with only a basic level of understanding of the data mining concepts. DMDSS enables the integration of data mining into daily business processes and decision processes through supporting several areas of analyses. A DMDSDS process model divides the traditional data mining expert role into a data mining administrator role and a data mining consultant.