Australian software development project.2 We will first relate their reported process, and then compare this with the CRISP and SEMMA frameworks. Table 2.1. Selected attributes from problem reports Attribute Description Synopsis Main issues Responsibility Individuals assigned Confidentiality Yes or no Environment Windows, Unix, etc. Release note Fixing comment Audit trail Process progress Arrival date Close date Severity Text describing the bug and impact on system Priority High, Medium, Low State Open, Active, Analysed, Suspended, Closed, Resolved, Feedback Class Sw-bug, Doc-bug, Change-request, Support, Mistaken, Duplicate 2 R. Nayak, Tian Qiu (2005). A data mining application: Analysis of problems occurring during a software project development process, International Journal of Software Engineering 15:4, 647?663. proach. Both aid the knowledge discovery process. Once models are obtained and tested, they can then be deployed to gain value with respect to business or research application. The project owner was an international telecommunication company which undertook over 50 software projects annually. Processes were organized for Software Configuration Management, Software Risk Management, Software Project Metric Reporting, and Software Problem Report Management. Nayak and Qiu were interested in mining the Example Data Mining Process Application 23 The data mining process reported included goal definition, data preprocessing, data modeling, and analysis of results. 1. Goal Definition Data mining was expected to be useful in two areas. The first involved the early estimation and planning stage of a software project, company engineers have to estimate the number of lines of code, the kind of documents to be delivered, and estimated times. Accuracy at this stage would vastly improve project selection decisions. Little tool support was available for these activities, and estimates of these three attributes were based on experience supported by statistics on past projects. Thus projects involving new types of work were difficult to estimate with confidence. The second area of data mining application concerned the data collection system, which had limited information retrieval capability. Data was stored in flat files, and it was difficult to gather information related to specific issues. 2. Data Pre-Processing This step consisted of attribute selection, data cleaning, and data transformation. Whenever a problem report was created, the project leader had to determine how long the fix took, how many people were involved, customer impact severity, impact on cost and schedule, and type of problem (software bug or design flaw). Thus the attributes listed below were selected as most important: ? Severity ? Priority ? Class data from the Software Problem Reports. All problem reports were collected throughout the company (over 40,000 reports). For each report, data was available to include data shown in Table 2.1: Data Field Selection: Some of the data was not pertinent to the data mining exercise, and was ignored. Of the variables given in Table 2.1, Confidentiality, Environment, Release note, and Audit trail were ignored as having no data mining value. They were, however, used during preprocessing and post-processing to aid in data selection and gaining better understanding of rules generated. For data stability, only problem reports for State values of Closed were selected. 24 2 Data Mining Process ? Arrival-Date ? Close-Date ? Responsible ? Synopsis The first five attributes had fixed values, and the Responsible attribute was converted to a count of those assigned to the problem. All of these attributes could be dealt with through conventional data mining tools. Synopsis was text data requiring text mining. Class was selected as the target attribute, with the possible outcomes given in Table 2.2: Table 2.2. Class outcomes Sw-bug Bug from software code implementation Doc-bug Bug from documents directly related to the software product Change-request Customer enhancement request Support Bug from tools or documents, not the software product itself Mistaken Error in either software or document Duplicate Problem already covered in another problem report Data Cleaning: Cleaning involved identification of missing, inconsistent, or mistaken values. Tools used in this process step included graphical tools to provide a picture of distributions, and statistics such as maxima, minima, mean values, and skew. Some entries were clearly invalid, caused by either human error or the evolution of the problem reporting system. For instance, over time, input for the Class attribute changed from SW-bug to sw-bug. Those errors that were correctable were corrected. If all errors detected for a report were not corrected, that report was discarded from the study. Data Transformation: The attributes Arrival-Date and Close-Date were useful in this study to calculate the duration. Additional information was required, to include time zone. The Responsible attribute contained information identified how many people were involved. An attribute Time-tofix was created multiplying the duration times the number of people, and then categorized into discrete values of 1 day, 3 days, 7 days, 14 days, 30 days, 90 days, 180 days, and 360 days (representing over one person-year). In this application, 11,000 of the original 40,000 problem reports were left. They came from over 120 projects completed over the period 1996?2000. Four attributes were obtained: Example Data Mining Process Application 25 ? Time-to-fix ? Class ? Severity ? Priority Text-mining was applied to 11,364 records, of which 364 had no time values so 11,000 were used for conventional data mining classification. 3. Data Modeling Data mining provides functionality not provided by general database query techniques, which can?t deal with the large number of records with high dimensional structures. Data mining provided useful functionality to answer questions such as the type of project documents requiring a great deal of development team time for bug repair, or the impact for various attribute values of synopsis, severity, priority, and class. A number of data mining tools were used. ? Prediction modeling was useful for evaluation of time consumption, giving sounder estimates for project estimation and planning. ? Link analysis was useful in discovering associations between attribute values. ? Text mining was useful in analyzing the Synopsis field. Data mining software CBA was used for both classification and association rule analysis, C5 for classification, and TextAnalyst for text mining. An example classification rule was: IF Severity non-critical AND Priority medium THEN Class is Document with 70.72% confidence with support value of 6.5% There were 352 problem reports in the training data set having these conditions, but only 256 satisfied the rule?s conclusion. Another rule including time-to-fix was more stringent: IF 21 ?? time-to-fix ? 108 AND Severity non-critical AND Priority medium THEN Class is Document with 82.70% confidence with support value of 2.7% There were 185 problem reports in the training data set with these conditions, 153 of which satisfied the rule?s conclusion. 26 2 Data Mining Process 4. Analysis of Results Classification and Association Rule Mining: Data was stratified using choice-based sampling rather than random sampling. This provided an equal number of samples for each target attribute field value. This improved the probability of obtaining rules for groups with small value counts (thus balancing the data). Three different training sets of varying size were generated. The first data set included 1,224 problem reports from one software project. The second data set consisted of equally distributed values from 3,400 problem reports selected from all software projects. The third data set consisted of 5,381 problem reports selected from all projects. Minimum support and confidence were used to control rule modeling. Minimum support is a constraint requiring at least the stated number of cases be present in the training set. A high minimum support will yield fewer rules. Confidence is the strength of a rule as measured by the correct classification of cases. In practice, these are difficult to set ahead of analysis, and thus combinations of minimum support and confidence were used. In this application, it was difficult for the CBA software to obtain correct classification on test data above 50%. The use of equal density of cases was not found to yield more accurate models in this study, although it appears a rational approach for further investigation. Using multiple support levels was also not found to improve error rates, and single support mining yielded a smaller number of rules. However, useful rules were obtained. C5 was also applied for classification mining. C5 used cross validation, which splits the dataset into subsets (folds), treating each fold as a test case and the rest as training sets in hopes of finding a better result than a single training set process. C5 also has a boosting option, which generates and combines multiple classifiers in efforts to improve predictive accuracy. Here C5 yielded larger rule sets, with slightly better fits with training data, although at roughly the same level. Cross validation and boosting would not yield additional rules, but would focus on more accurate rules..