Data Integration-Huawei FusionInsight

Data Integration
Function Feature

Heterogeneous storage

cross heterogeneous data storage systems.

Autoscaling

on-demand resource allocation.

Multiple data sources

tens of data sources.

Fully automated

automated data feature identification and mapping creation based on abundant data and machine learning.

Function View
Function Description
  • Configuration processCollapse
    • Configuring data flows/control flows
    • Provide web pages for configuring control flows and data flows.

      Control flow configuration page

    • Scheduling plan
    • Provide web pages for configuring flow scheduling time and intervals.

    • Online test
    • Provide a flow execution status display area. Icons are used to indicate the execution status of flows.

      Online test page

  • Task schedulingOpen
    • Provide various scheduling triggering methods, including time-based scheduling, manual scheduling, and interface-based scheduling. Provide priority-based control so that flows with higher priorities can be executed first.

      Setting scheduling time

  • Control flowOpen
    • A control flow determines the sequence of executing tasks. Data is not transmitted between the tasks.

      The following table describes the control flow functions.

      • Function
      • Description
      • FTP/SFTP Upload/Download
      • Data can be transmitted between a node server (FTP server) and the BDI or HDFS through FTP.
      • Stored Procedure
      • A stored procedure task schedules and executes database stored procedures.
      • External Application
      • The BDI provides an interface for external programs such as the Shell program. By using the interface, you can run the operating system commands, third-party programs, or self-developed programs.
      • Convert
      • A convert task schedules and executes data flows.
      • Calculate
      • A calculate task calculates control flow variables.
      • Trigger
      • This topic describes how to create a Trigger task between control flows. A Trigger task establishes triggering relationships between control flows or tasks rather than processing data.
      • Convergence
      • A convergence task converges multiple task flows. A convergence task does not perform any data processing operations. It is used only for convergence purpose.
      • Total Waiting Files
      • When the number of reached waiting files meets the requirements, the file waiting task is executed successfully. If no file reaches or the number of reached files is insufficient when the file waiting task times out, the task fails to be executed.
      • Dependence
      • A dependence task is used to add dependence relationships in different control flows. A dependence task does not perform actual data processing. It is used only for establishing the dependence relationships between tasks.
      • Index Audit
      • An indicator audit is used to verify the indexes in the data processing process of a data flow task by using a verification expression and generates valid pre-alarms.
      • User-defined Node
      • A User-define node invokes user-defined control flow task in a control flow.
      • Startloop
      • A Startloop node functions as a loop entrance and it does not execute any data processing tasks. A loop flow can be configured only when it contains a Startloop node.
      • Empty Task
      • An empty task is a task that processes nothing. When there are multiple tasks in the flowchart of a control flow and the functions of one or more tasks are unclear, you can introduce Empty Tasks to replace them temporarily. After confirming the functions of the tasks, you can replace the actual tasks with the introduced Empty Tasks.You can also introduce another task as the initial task. The reason for introducing an Empty Task is that the Empty Task does not generate any dirty data.

      Control flow functions

  • Data FlowOpen
    • A data flow processes data sets, including extracting, cleaning, auditing, and loading data.

      The following table describes the data extraction functions.

      • Function
      • Description
      • Extract HDFS Text
      • An Extract HDFS Text node extracts data from HDFS nodes. The Extract HDFS Text node can extract data from files of different types and files of different formats (for example, fixed-length files, delimiter files and Name-Value files).
      • Extract HDFSXml
      • An Extract HDFSXml node extracts data from XML files in HDFSs.
      • Extract XML
      • An Extract XML node extracts data from XML files on the BDI server.
      • Extract JDBC
      • An Extract JDBC node extracts data from database tables that support the JDBC database connection technology.
      • Extract Oracle
      • An Extract Oracle node extracts data from Oracle databases by using the Oracle call interface (OCI) provided by the Oracle databases. The processing speed of the Extract Oracle node is faster than that of the Extract JDBC node.
      • Extract DB2
      • An Extract DB2 node extracts data from DB2 databases through the DB2 call level interface (CLI) provided by the DB2 databases. The processing speed of the Extract DB2 node is faster than that of the Extract JDBC node.
      • Extract HBase
      • An Extract HBase node extracts data from the HBase. The HBase is a distributed and column-based open-source database. The column-based technology enables the database to store unstructured data.
      • Extract Hive
      • An Extract Hive node extracts data from the Hive data warehouse. Hive is a Hadoop-based data warehouse that maps structured data files to a table and allows users to query data using Hive Query Language (HQL) statements.
      • Extract Memory
      • The Extract Memory node extracts data from a specified X detail record (XDR) table in the shared memory of a VecSurf Compute Unit (VCU).
      • Extract Text
      • An Extract Text node reads data from files of different types (for example, common files, .gz files) and files of different formats (for example, fixed-length files, delimiter files and Name-Value files).

      Data flow - data extraction functions

      The following table describes the data transformation functions.

      • Function
      • Description
      • Filter
      • A Filter node is used to filter data sets by using function expressions.
      • Group
      • The group node supports both system functions and user-defined functions. Therefore, users can perform different calculation operations on fields. Conversion functions are a series of functions that clean and process the data in data flows. Conversion functions are classified into two types: system function and user-defined function. System functions are the built-in transformation functions of the BDI system. They support various data processing functions involved in data warehouse projects. User-defined functions are developed by users based on the site requirements.
      • Lookup
      • A Lookup node provides a variety of search modes: Exact Search, Descend Search, Zone Search and Fuzzy Search.
      • Join
      • A connect node is used to map two data sets based on fields. The system searches the two source data sets for specified key fields and exports the search result based on the connection type.
      • Deduplicate
      • A deduplication node is used to remove duplicate records from input data sources by Lookup Fields to ensures that the output records are not with duplicate Lookup Fields.
      • Route
      • A Route node splits data. Users can specify a data splitting rule, based on which the Data Integration module splits the original data file into multiple files. The Route node splits data by field, for example, the municipal branch, month, or data value.
      • Merge
      • A data set is a group of data that is read from a data source through a data extraction node. In a data merge stage, data sets contain the main data set and incremental data set. For each record in the incremental data set of a data merge node, users need to check whether the same record exists in the main data set according to the Lookup Fields. If the same record exists in the main data set and incremental data set, replace the record in the main data set with the record in the incremental data set; otherwise, add the record in the incremental data set to the main data set.
      • Sort
      • A sort node is used to sort source data based on key fields and allows users to sort source data by any field.
      • Convert
      • A convert node allows users to add a new field and define function expression for the field. The system calculates input fields using the user-defined expression. Users can export the new field to target files or data sets. The Convert node can convert data types.
      • Union
      • A Union node unites the mapped fields in several data sets to form a new data set. A Union node reserves the same records when performing the unite operation, while a Merge node merges the same records based on the preset key field.
      • Column to Row
      • A Column to Row node transforms columns to rows in a table. The table to be transformed must contain at least two columns.
      • Append Merge
      • An Append Merge node merges incremental data into the master data based on the insertion, deletion, and update keywords.

      Data flow - transformation functions

      The following table describes the data loading functions.

      • Function
      • Description
      • Load Text
      • A Load Text node can load processed data to one or more target files. It can also export the line numbers of the exported records.
      • Load HDFS Text
      • A Load HDFS Text node loads processed data to one or more target profiles and exports the line number of the record.
      • Load HDFSXml
      • HDFS is short for Hadoop Distributed File System. A Load HDFSXml node loads data to XML files in HDFSs.
      • Load XML
      • A Load XML node loads data sets to XML files for storage according to file formats.
      • Load JDBC
      • A Load JDBC node uses Java Database Connectivity (JDBC) to load processed data to the database. The Load JDBC node can be used in databases that support the JDBC database connection technology.
      • Load Oracle
      • A Load Oracle node uses Oracle SQL*Loader to upload calculated or filtered data to the Oracle database.
      • Load DB2
      • A Load DB2 node schedules DB2 Loader commands to load processed data to the DB2 database. The DB2 Loading node is specialized and effective.
      • Load HBase
      • A Load HBase node loads converted data to the HBase database.
      • Load Greenplum
      • A Load Greenplum node loads transformed data to the Greenplum database.
      • Load Hive
      • A Load Hive node loads data to the Hive data warehouse. Hive is a Hadoop-based data warehouse that maps structured data files to a table and allows users to query data using Hive Query Language (HQL) statements.
      • Load Slowly Changing Dimension Data
      • If the entity type of the governance model to be loaded is Oracle Typ2 Dimension Table, the system loads the model to the Oracle database in JDBC Slowly Changing Dimension (TYPE2) mode by default.
      • Load DG
      • A Load DG node connects to the DG for extracting data from data models.
  • Flow MonitoringOpen
    • Real-Time flow monitoring
    • Query the status of control flows that are being executed or suspended.

    • History flow monitoring
    • Query the status of control flows that have been executed successfully, failed to be executed, or terminated.

    • Warning message
    • Support the warning function. Operators can add, modify, delete, and query warning messages by task type or task name in the BDI.

    • Process tracing
    • Process tracing refers to the process of using certain metadata as the start point and displaying all downstream metadata in graphics to show the data direction and processing procedure. Process tracing can be used to determine data flow direction and locate data transformation faults.

    • Progress monitoring
    • Support the function of querying task scheduling time distribution.

      Progress monitoring