Dynamic scaling for data processing streaming system

    公开(公告)号:US11095522B2

    公开(公告)日:2021-08-17

    申请号:US16547399

    申请日:2019-08-21

    IPC分类号: H04L29/06 H04L12/24

    摘要: Described herein is a system and method for dynamically scaling a stream processing system (e.g., “exactly once” data stream processing system). Various parameter(s) (e.g., user-configurable capacity, real-time load metrics, and/or performance counters) can be used to dynamically scale in and/or scale out the “exactly once” stream processing system without system restart. Delay introduced by this scaling operation can be minimized by utilizing a combination of mutable process topology (which can dynamically assign certain parts of the system to a new host machine) and controllable streaming processor movement with checkpoints and the streaming protocol controlled recovery which still enforces the “exactly once” delivery metric.

    Streaming joins in constrained memory environments

    公开(公告)号:US09880769B2

    公开(公告)日:2018-01-30

    申请号:US14732374

    申请日:2015-06-05

    IPC分类号: G06F3/06 G06F17/30

    摘要: Large amounts of memory can be consumed in streaming joins because events from one stream are held in memory while waiting for matching events from a second stream. Memory needs can be reduced by analyzing the join condition to determine the bounds on the time discrepancy between events in the two streams. When it is determined that an event from one stream must occur prior to the matching event from the other stream, the later-arriving stream data can be ingested with an intentional delay. When it is determined that regardless of input received from a first stream, no output will be produced when there is no input from the second stream, pulling data from the first stream can cease. A multi-stage join plan can be employed so that a less busy stream can be scanned with increasing amounts of intentional delay. Only unmatched data is stored.

    Journaling of streaming anchor resource(s)

    公开(公告)号:US11226966B2

    公开(公告)日:2022-01-18

    申请号:US16590909

    申请日:2019-10-02

    摘要: Described herein is a system and method of journaling of a streaming anchor resource. An input node can store a value of a property associated with the streaming data in a persistent indexed data structure. The input node can generate an anchor that describes a particular point in time in a data stream. The anchor can include an index into the persistent indexed data structure of the stored value of the property associated with the streaming data. The generated anchor and streaming data can be provided to the downstream node. During recovery of a downstream node, the input node can utilize a received anchor to retrieve a value of a property associated with the streaming data from the persistent indexed data structure, and, provide a batch of data based upon the received anchor and the retrieved property value.

    STREAMING JOINS IN CONSTRAINED MEMORY ENVIRONMENTS
    5.
    发明申请
    STREAMING JOINS IN CONSTRAINED MEMORY ENVIRONMENTS 有权
    在受限制的记忆环境中进行流水作业

    公开(公告)号:US20160357476A1

    公开(公告)日:2016-12-08

    申请号:US14732374

    申请日:2015-06-05

    IPC分类号: G06F3/06

    摘要: Large amounts of memory can be consumed in streaming joins because events from one stream are held in memory while waiting for matching events from a second stream. Memory needs can be reduced by analyzing the join condition to determine the bounds on the time discrepancy between events in the two streams. When it is determined that an event from one stream must occur prior to the matching event from the other stream, the later-arriving stream data can be ingested with an intentional delay. When it is determined that regardless of input received from a first stream, no output will be produced when there is no input from the second stream, pulling data from the first stream can cease. A multi-stage join plan can be employed so that a less busy stream can be scanned with increasing amounts of intentional delay. Only unmatched data is stored.

    摘要翻译: 因为来自一个流的事件被保留在存储器中,同时等待来自第二个流的匹配事件,所以在流连接中可以消耗大量的存储器。 可以通过分析连接条件来确定两个流中的事件之间的时间差异的界限来减少内存需求。 当确定来自一个流的事件必须在来自另一个流的匹配事件之前发生时,可以以有意的延迟来摄取后来到达的流数据。 当确定不管从第一流接收的输入如何,当没有来自第二流的输入时,将不会产生输出,从第一流中拉取数据可以停止。 可以采用多级连接计划,以便能够以更多的有意延迟扫描较不繁忙的流。 只存储不匹配的数据。

    Using anchors for reliable stream processing

    公开(公告)号:US10148719B2

    公开(公告)日:2018-12-04

    申请号:US14732416

    申请日:2015-06-05

    IPC分类号: G06F15/16 H04L29/06

    摘要: Stream processing can be performed using a pull-based, anchor-based methodology that guarantees once and only once processing and repeatability of the creation of output with no additional communication overhead during normal processing. Each node (computing device) in the graph (representing interconnected computing devices) establishes a system of anchors. An anchor describes a point in the output stream of the node, so that every event in the stream is either before or after any given anchor.

    USING ANCHORS FOR RELIABLE STREAM PROCESSING
    7.
    发明申请
    USING ANCHORS FOR RELIABLE STREAM PROCESSING 审中-公开
    使用锚杆进行可靠的流程处理

    公开(公告)号:US20160359940A1

    公开(公告)日:2016-12-08

    申请号:US14732416

    申请日:2015-06-05

    IPC分类号: H04L29/06

    CPC分类号: H04L65/604 H04L65/4069

    摘要: Stream processing can be performed using a pull-based, anchor-based methodology that guarantees once and only once processing and repeatability of the creation of output with no additional communication overhead during normal processing. Each node (computing device) in the graph (representing interconnected computing devices) establishes a system of anchors. An anchor describes a point in the output stream of the node, so that every event in the stream is either before or after any given anchor.

    摘要翻译: 可以使用基于引用的锚定方法来执行流处理,该方法在正常处理期间保证一次处理和重复创建输出,而没有额外的通信开销。 图中的每个节点(计算设备)(表示互连的计算设备)建立了锚的系统。 锚点描述了节点的输出流中的一个点,以便流中的每个事件在任何给定的锚点之前或之后。

    HANDLING OUT OF ORDER EVENTS
    8.
    发明申请
    HANDLING OUT OF ORDER EVENTS 有权
    处理订单事件

    公开(公告)号:US20160359910A1

    公开(公告)日:2016-12-08

    申请号:US14732398

    申请日:2015-06-05

    IPC分类号: H04L29/06

    摘要: Processing streaming data in accordance with policies that group data by source, enforce a maximum permissible late arrival value for streaming data, a maximum permissible early arrival for data and/or a maximum degree to which data can be out of order and still be compliant with the out of order policy is described. The correct starting point for reading a data stream so as to produce correct output from a given output start time can be enabled using the early arrival policy. Using combinations of policies, output can be generated promptly (with low latency). When input from a given source is not disrupted, output can be generated with low latency. Output can be generated even when the input stops by applying a late arrival policy.

    摘要翻译: 根据按源分组数据的策略处理流数据,执行流数据的最大允许延迟到达值,数据的最大允许早期到达和/或数据可能失序的最大程度,并且仍然符合 描述了乱序策略。 用于读取数据流以便从给定输出开始时间产生正确输出的正确起始点可以使用提前到达策略来实现。 使用策略的组合,可以及时生成输出(低延迟)。 当给定源的输入不被中断时,可以以低延迟生成输出。 即使通过应用延迟到达策略停止输入,也可以生成输出。

    Enhanced anchor protocol for event stream processing

    公开(公告)号:US11044291B2

    公开(公告)日:2021-06-22

    申请号:US16145456

    申请日:2018-09-28

    摘要: Described herein is a system and method for startup and/or recovery for stream processing. During a startup phase: start anchor request(s), each identifying a particular time, are accumulated until request(s) are pending from downstream nodes. A minimum time of the accumulated start anchor request(s) is determined. If the processing system is an input node, an anchor associated with the determined minimum time is generated. Otherwise, a start anchor request is provided to an upstream node identifying the determined minimum time. Once the anchor associated with the determined minimum time is received (or generated), the anchor is provided in response to a polled start anchor request for the determined minimum time from a downstream node. Asynchronous requests for batches of data bounded by two specific anchors are performed in accordance with information stored in an ordered collection of anchors during a recovery phase.

    Efficient out of process reshuffle of streaming data

    公开(公告)号:US11010171B2

    公开(公告)日:2021-05-18

    申请号:US16426683

    申请日:2019-05-30

    摘要: Methods, systems, apparatuses, and computer program products are provided for processing a stream of data. A maximum temporal divergence is established for data flushed to a data store from a plurality of upstream partitions. Each of a plurality of data flushers, each corresponding to an upstream partition, may obtain an item of data from a data producer. Each data flusher may determine whether flushing the data to the data store would exceed the maximum temporal divergence. Based at least on determining that flushing the data to the data store would not exceed the maximum temporal divergence, the data may be flushed to the data store for ingestion by a downstream partition and a data structure (e.g., a ledger) may be updated to indicate a time associated with the most recent item of data flushed to the data store.