Data Mining - Huawei FusionInsight

Data Exploration
Function Feature

Various data formats

The SmartMiner can import data from the local file system, remote file system, HDFS file system, and database (JDBC import).

Concurrent data analysis

The mining algorithms used by the SmartMiner support concurrent computing, and can run in the Spark and Hadoop.

Application modeling automation

Automatic modeling in wizard mode supports quick creation and update of models, simplifying modeling and reducing the time.

Real-time model application

Support model import and export, and support analysis and mining model creation based on existing models.Model application time is shortened to seconds.

Functional View
Function
  • File ManagementCollapse
    • Data import
    • The SmartMiner supports import of .txt and .csv files. Users can import a file to the SmartMiner from the local host, HDFS, or database.

      • Function
      • Description
      • TextImport node
      • The ImportText node reads and parses formatted files. A formatted file contains a fixed number of fields separated by separators. The number of characters in a field is changeable. The TextImport node reads data from a text file record by record.
      • ImportFeatureLibrary node
      • The ImportFeatureLibrary node combines corresponding fields in feature files based on a specified target field and multiple specified feature fields to generate sample data required for data mining. Files involved in field combination must have the same primary key, for example, user ID. Invalid data is filtered out during field combination.
      • FolderImport node
      • The FolderImport node imports folders and displays folder and text information.
      • ImportDatabase node
      • The ImportDatabase node extracts data from a single table or view in a database.

      Data import

    • Data preprocessing
    • Data preprocessing includes the data type and value range definition, input and output configuration, and data binning, partitioning, sampling, and filtering, which improves data quality and helps ensure data analysis accuracy.

      • Function
      • Description
      • Type node
      • The Type node specifies the data role, direction, and default value for each field in a data set, and verifies field types.
      • Binning node
      • The Binning node divides the attribute value range of fields of the Range role into segments and assigns a value to each segment, which reduces the number of attribute values. The Binning node can create a field of the Set role based on one or more values of range segments. For example, the node can change the customer income range into a set of income groups or a set of differences from the average income.
      • Partition node
      • The Partition node generates partition fields. It partitions data into subsets and samples for the training and test phases in the modeling process. During the modeling process, a sample is used to generate a model and another sample is used to test the model. In this way, the system can check the forecast accuracy deviation of the model on large size data sets similar to the data samples.The Partition node generates fields of the Sign role, and these fields can be specified as partition fields on the Type node.
      • Filler node
      • The Filter node filters data based on the correlation between analysis fields and forecast fields or by specified fields.Correlation filtering is a process of filtering fields based on the Error Decrease Rate of the analysis fields and forecast fields. The system automatically reserves input fields whose Error Decrease Rate is greater than the threshold based on the maximum number of fields to be reserved. This node is generally used for filter data records with many fields.Users can also manually filter fields. The Filter node functions the same as the filtering function on the TextImport node.
      • FeatureSelection node
      • The FeatureSelection node filters out invalid or indistinct attributes based on filter criteria.
      • Sampling node
      • RandomThe Sampling node extracts data at a specified ratio. For example, if a user has 10 million records and the sampling ratio is 0.5, the node will extract 5 million records.EquidistantThe Sampling node extracts a record from every N records. For example, if a user has 10 thousand records, N is 10, and the maximum sample size is 100, the Sampling node will extract 100 records.ClusterThe Sampling node extracts records from a group with a specified field at a specified ratio. For example, if the sampling field is school and the sampling ratio is 0.5, the Sampling node will extract 50% of the records from the school group.Multiple sampling fields can be set, and the node will extract records across the specified groups.StratifiedThe Sampling node extracts records at a specified ratio from each sampling group specified by a sampling field.BalancedThe Sampling node balances discrete fields. This kind of sampling is sampling with replacement. In sampling with replacement, extracted data is put back to the samples for next extraction, which balances the value types in the final result.

      Data preprocessing

    • Data analysis
    • A data analysis node collects statistics on KPIs related to a single attribute or multiple attributes.

      • Function
      • Description
      • Feature analysis of a single attribute
      • Feature analysis of a single attribute refers to collecting counters related to an attribute, including:the total number of recordsaverage valuemaximum valueminimum valuesumrangedeviationstandard deviationS.E.meanskewnessskewness standard deviationkurtosiskurtosis standard deviationoutlier
      • Feature analysis of multiple attributes
      • In the correlation analysis of multiple attributes, qualitative analysis counters including:chisquare verification resultsT verification resultsF verification results quantitative statistics counters including:Kruskal TauPearsonEtaMAEMSERMSESpearman

      Data analysis

  • Model ManagementOpen
    • Model Import
    • The SmartMiner allows users to create analysis themes and import multiple data mining models of the same objectives to each theme.

      During analysis model import, you need to specify a model file and a process file and an evaluation file that are associated with the model.

    • Model Cascading
    • If you want to build a model after it is applied, you do not have to import source data again. Instead, use an Application node (TextClassifyApply node excluded) as the source node in the new modeling process. The Application node can be followed by a Type, DataAudit, Statistics, GraphVisualize, Correlate, or Filter node.

      During analysis model import, users need to specify a model file as well as a process file and an evaluation file associated with the model.

      Model Cascading

    • Model evaluation
    • The SmartMiner provides visualized pages for model application evaluation. Each algorithm has its own evaluation KPIs. For example, the classification evaluation KPIs in the process include True Positive Rate (TPR), False Positive Rate (FPR), F value, precision, and AUC.

      ClassifyEvaluation

      The ClassifyEvaluation node is used to evaluate classification accuracy.

      ClassifyEvaluation

      RecommenderEvaluation

      The RecommenderEvaluation node is used to evaluate the rating counter, classification counter, coverage, accuracy, variety, and novelty of personalized recommendation lists.

      RecommenderEvaluation

      NumericalEvaluation

      The NumericalEvaluation node is used to evaluate the accuracy of value forecast.

      NumericalEvaluation

      ClusterEvaluation

      The ClusterEvaluation node is used to evaluate clustering accuracy.

      ClusterEvaluation

  • Flow orchestrationOpen
    • Manual Modeling
    • The process is manually configured by dragging nodes, and parameters on each node are manually modified to adjust model accuracy. This process is applicable to professionals who are familiar with input data and algorithms.

      Model Cascading

    • Automatic Modeling
    • The system instructs users to configure the input data file and select the modeling algorithm and evaluation parameters in wizard mode. Then the system automatically performs the following operations:

      Automatic preprocessing

      − Automatically clean the original data to improve data quality.

      − Automatically identify data types based on the data distribution.

      − Automatically process abnormal and missing data and standardize data based on the data distribution and exception check.

      Automatic parameter selection

      − Automatically filters data features based on KPIs including the AUC and TAU.

      − Use the gradual random search. The system automatically analyzes and obtains the optimal value range of each parameter based on the search effect and gradually narrows down the search range.

      Automatic algorithm selection

      Automatically select basic, intermediate, or advanced algorithms based on user input and historical modeling experience. No manual configuration is required.

      Automatic modeling

      − Automatically select matching algorithms based on the forecast objective.

      − Automatically implement model enhancement and optimization based on forecast feedback.

      Automatic model evaluation

      − Automatically select evaluation KPIs based on the algorithm type.

      − Automatically select the optimal model based on the evaluation KPIs.

  • Algorithm libraryOpen
    • The SmartMiner provides more than 20 algorithms of different types to train models, including the classification, clustering, forecast, influence evaluation, and recommendation algorithms.

      • classification
      • Function
      • Description
      • Classification algorithms
      • NaiveBayes node
        NaiveBayes classifier is a classification method in statistics. The NaiveBayes node forecasts the class membership probabilities, for example, the probability that a sample belongs to a specified class. The NaiveBayes node can build models to forecast event probability by analyzing event attributes based on the system's cognition towards reality and obtained records.
        DecisionTree node
        The DecisionTree node can develop a classification system. Using this system, you can forecast results or classify records based on predefined decision policies.
        Logistics node
        The Logistics node determines the cause-effect relationships between variables, sets up regression models, and checks the correlations between symptoms and the correlation directions and levels.
        RandomForest node
        The RandomForest node supports a large number of features and builds multiple decision tree models to abstract classification rules through random sampling, which avoids overfitting caused by the use of a single decision tree.
        OverlapNeighbour node
        The OverlapNeighbour node finds node pairs that have overlapping neighboring points.
        SparseLinear node
        The SparseLinear node supports a large number of features, precisely analyzes multi-dimensional data, and builds models.
      • Clustering algorithms
      • Kmeans node
        The Kmeans node groups data sets into different cluster centers (or clusters). This method defines a fixed number of clusters, classifies records to clusters in iteration mode, and adjusts the cluster center until the model can no longer be optimized. The Kmeans node is a non-monitoring learning mechanism. It finds hidden patterns behind input data sets instead of forecasts results.
        EM node
        The EM node groups data sets into different cluster centers (or clusters). This method defines a fixed number of clusters, calculates the probability that each record belongs to a cluster, and updates the probability iteratively until the probability change is less than the preset Iteration End Threshold or the Maximum number of iteration times is achieved.
      • Recommendation algorithms
      • Apriori node
        The Apriori node analyzes and mines data associations to obtain valuable information for the decision process.
        MinHash node
        The MinHash node analyzes the similarity between two data sets quickly.
        CF node
        The CF node analyzes the similarity between users or items, and provides personalized offers to users based on the similarities.
        SNSRS node
        The SNSRS node uses the SNS topology to build models and obtain the recommendations that are hidden behind the network.
        PersonalTag node
        The PersonalTag node analyzes the initial preferences, preview history, and features of previewed contents of users, and recommends offers to users accordingly.
        DiscriminationTree node
        The Discrimination node provides recommendations to new users based on the existing user group information as follows: The system asks a new user questions, uses the answers to find a matching user group for this user, and recommend preferences of the user group to the user. (A recommendation can be a movie that has the highest score or is most frequently watched.)
        SimilarFeature node
        The SimilarFeature node calculates the similarity of contents based on the features and the feature weight.
        FullConnected node
        The FullConnected node is used to find fully connected submaps for home networks.
        LDA node
        Latent Dirichlet Allocation (LDA) is a way of automatically discovering themes in a large number of files and predicting the generation of a theme model. LDA can also find categories that users prefer and recommend by category.
      • Influence evaluation algorithms
      • SPA node
        The SPA node Opens influence and identifies users based on the SNS network.You can use the SPA node to forecast results by classification, for example, customer loss probability and whether a customer will accept an offer. For example, if the system wants to forecast customer loss probability, it defines some lost customers on the SNE network, and finds the influence the lost customers have on other customers based on their call frequency and duration. Then the system calculates the customer loss probability based on the obtained data and iteratively Opens the calculated probability through the influence spreading expression until the probability seldom changes.
        PageRank node
        The PageRank algorithm measures node importance. For example, it measures the importance of website pages and ranks them by importance.
      • Forecast algorithms
      • TimeSeries node
        The TimeSeries node finds rules in sequence data, that is, a trend that the data changes over time to forecast the future value.
        Linear node
        The Linear node determines the cause-effect relationships between variables, sets up regression models, and checks the correlations between symptoms and the correlation directions and levels.
        GDBT node
        The GBDT algorithm is an iterative DecisionTree algorithm. It consists of multiple decision trees. The regression trees from each iteration are merged based on their weights. The algorithm is used to solve regression and dichotomy problems.
      • Dimensionality reduction algorithms
      • PCA node
        The PCA node transforms multiple indexes to few comprehensive indexes that are not correlated.
      • Natural language algorithms
      • TextClassify node
        The TextClassify node segments text and forecasts its classification.
        Segment node
        Chinese segmentation refers to the process of dividing written Chinese text into meaningful words based on specific rules, that is, converting the original unstructured text into structured information that computers can process.Algorithms on the Segment node are based on the Ansj framework, which is the Java version of the ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System). The SmartMiner has implemented the parallel computing capability for segmentation, improving the segmentation speed and accuracy.
      • Trajectory analysis algorithm
      • StayPointAnalysis node
        In a trajectory, some points denote locations where people have stayed for a while, such as the shopping malls, tourist attractions, or gas stations. These kinds of points are called stay points.

      Data Modeling Function List