Flow Collection - Huawei FusionInsight

Streaming Data Ingestion and Processing
Function Feature

Flume Data Ingestion

The Streaming uses the distributed data collection software Flume to collect original signaling messages (a dozen of fields in general) of the A interface and Gn interface from the network signaling mirroring systems, preprocess the messages (for example, remove 86 from the phone number and filter out redundant fields), and save the messages to the topic X of the Kafka.

Kafka Data Distribution

The ingested data is saved in the Hadoop Kafka system and can be transferred to other components by sequence.

Spark Data Computing

The Hadoop Spark Streaming is a stream computing system that quickly implements data accumulation and classification, generates data required by services, and saves the data into the Kafka topics.

Be compatible with open-source frameworks

Provide a unified IDE development environment.Be compatible with multiple open-source stream computing frameworks.

Function View
Function Description
  • Basic TermsCollapse
    • Event

      Basic information imported by external data sources. In a telecom network, user behavior can be recorded by events. For example, a call is an event.

      Streaming data ingestion

      Ingestion of real-time events performed by the Flume of the Hadoop. The Flume ingestion rules can be customized, for example, the source and destination of events and whether preprocessing is required.

      Streaming data distribution

      A process of saving the ingested streaming data to a common place and then distributing the data to service systems or middleware. The Hadoop Kafka in the CAE distributes real-time events. The service systems or middleware can subscribe to data in the Kafka.

      Stream computing

      Data computing performed by the Hadoop Spark in the CAE. A Spark rule is a computing rule preconfigured in the CAE or obtained through secondary development. A Spark task is a Spark Streaming Compute instance running in the Yarn. A task generally has a task ID and can be submitted, stopped, and queried. The task status includes submitting, submitted, running, stopping, and stopped.

  • Basic TermsOpen
    • Logical Architecture

      The management node (Streaming Server): provides interfaces for external systems to invoke stream computing configuration services. It also provides the IDE GUI to develop and manage stream ingestion processes.

      Data ingestion (Flume): preconfigured with multiple data interceptors and can customize data to be collected for services.

      Real-time distribution (Kafka): supports real-time distribution of events on computing nodes in a cluster and can be extended horizontally.

      Computing and processing: The Spark Streaming is used for basic event stream computing and the Huawei-developed PME is used for complex event pattern matching computing.

      Data Ingestion

      The Streaming uses the distributed data collection software Flume to collect original signaling messages (a dozen of fields in general) of the A interface and Gn interface from the network signaling mirroring systems, preprocess the messages (for example, remove 86 from the phone number and filter out redundant fields), and save the messages to the topic X of the Kafka.

      Data Distribution

      The ingested data is saved in the Hadoop Kafka system and can be transferred to other components by sequence.

      Kafka

      Broker: Server in the Kafka cluster.

      Topic: Category of a message released to the Kafka cluster. (Physically, messages of different topics are separately stored. Logically, messages of a topic can be stored in one or more brokers and data can be generated or consumed using specific topics instead of data locations.)

      Partition: Each topic contains one or more partitions and users can specify partition count during topic creation. Each partition corresponds to a directory where the data and index files of the partition are stored.

      Producer: Generates messages and releases messages to the Kafka brokers.

      Consumer: Consumes messages obtained from the Kafka topics.

      Data Computing

      The Hadoop Spark Streaming is a stream computing system that quickly implements data accumulation and classification, generates data required by services, and saves the data into the Kafka topics.

      Spark Streaming

      The Spark Streaming cuts continuous data stream into batch smaller data sets and submits the data sets as Spark computing tasks to continuously obtain the computing results of each batch.

  • Function FeatureOpen
    • Event Management Center

      Manage event categories, query events, create events, and use the preconfigured event editing tool to edit and import events in batches.

      Stream Ingestion Orchestration

      1.Configuring the source data: Currently, the Spooling Directory (customized to read decompressed files), Avro, FTP/SFTP, and SDTP are supported.

      2.Configure the channel: Currently, the Memory Channel is supported.

      3.Configuring data interceptors

      Multiple interceptors are customized in the CAE, for example:

      - Field projecting — ingesting fields required by services

      - Field filtering — filtering data by field values

      - Field transformation — processing field values by functions and generating the results

      - Field encryption — encrypting field values and outputting the result

      - Field standardization — for example, removing 86 from a phone number and unifying formats of data fields

      4.Configuring the sinks: The Kafka and HDFS are currently supported.

      5.(Optional) Configuring the selector: Configure the sinks for channels.

      6.(Optional) Configuring sink groups: Set multiple sinks into a group to process the channel data.

      Configuring the CAE on the IDE page

      Stream Computing Orchestration

      This function provides the stream computing editing page to configure the stream computing rule name, description, and rule content in a visualized way. The following information can be configured in a computing rule:

      1.Source, whose options are as follows: Kafka, Socket

      2.Interceptor, whose options are as follows: Projection, Filtering, Association and Backfill, Grouping.

      3.Sink, whose options are as follows: Kafka, HDFS

      Stream Computing Management

      1.Customize Spark computing tasks.

      Spark task rule development process:

      - Develop Spark task rules and implement the SDK Spark Application parameter check interface in the entry class.

      - Use the encapsulated task generation interface in the SDK if task results are sent to the Kafka or Oracle database.

      - Use the REST API to start, stop, or query the task.

      Spark task deployment methods:

      Compress the JAR package and XML files containing configuration information into a ZIP package, and import it on the CAE GUI.

      2.Use Spark computing tasks preconfigured in the CAE.