-
91.
公开(公告)号:US11886399B2
公开(公告)日:2024-01-30
申请号:US17006504
申请日:2020-08-28
发明人: John Joyce , Marshall A. Isman , Sandrick Melbouci
CPC分类号: G06F16/215 , G06F16/2228 , G06F16/285 , G06N5/04 , G06N20/00
摘要: Methods and systems are configured to determine a semantic meaning for data and generate data processing rules based on the semantic meaning of the data. The semantic meaning includes syntactical or contextual meaning for the data that is determined, for example, by profiling, by the data processing system, values stored in a field included in data records of one or more datasets; applying, by the data processing system, one or more classifiers to the profiled values; identifying, based on applying the one or more classifiers, one or more attributes indicative of a logical or syntactical characteristic for the values of the field, with each of the one or more attributes having a respective confidence level that is based on an output of each of the one or more classifiers. The attributes are associated with the fields and are used for generating data processing rules and processing the data.
-
公开(公告)号:US20230409835A1
公开(公告)日:2023-12-21
申请号:US18201545
申请日:2023-05-24
IPC分类号: G06F40/30 , G06F16/93 , G06N20/00 , G06F16/908
CPC分类号: G06F40/30 , G06F16/908 , G06N20/00 , G06F16/93
摘要: A data processing system for discovering a semantic meaning of a field included in one or more data sets is configured to identify a field included in one or more data sets, with the field having an identifier. For that field, the system profiles data values of the field to generate a data profile, accesses a plurality of label proposal tests, and generates a set of label proposals by applying the plurality of label proposal tests to the data profile. The system determines a similarity among the label proposals and selects a classification. The system identifies one of the label proposals as identifying the semantic meaning. The system stores the identifier of the field with the identified one of the label proposals that identifies the semantic meaning.
-
公开(公告)号:US11835994B2
公开(公告)日:2023-12-05
申请号:US16517320
申请日:2019-07-19
发明人: Andrew Blom , Darren Miller , Marshall A. Isman
IPC分类号: G06F7/00 , G06F17/00 , G06F16/25 , G06F16/901 , G06F8/34 , H04L67/565
CPC分类号: G06F16/254 , G06F8/34 , G06F16/258 , G06F16/9024 , H04L67/565
摘要: A method for generating an executable application to transform and load data into a structured dataset includes receiving a metadata file that specifies values for parameters for structuring data feeds, received from a networked data source, into a structured database. The metadata file specifies logical rules for transforming the data feeds. The values of the parameters and the logical rules for transforming the plurality of the data feeds are validated to ensure logical consistency for each data feed. Data rules are generated that specify standards for transforming each data feed in accordance with the validated values of the parameters and logical rules. The executable application is generated that is configured to receive source data comprising a data feed from one or more data sources and transform the source data into structured data that satisfies the one or more standards for the structured data record in compliance with the data rules.
-
公开(公告)号:US20230359668A1
公开(公告)日:2023-11-09
申请号:US18114212
申请日:2023-02-24
IPC分类号: G06F16/901
CPC分类号: G06F16/9024
摘要: Described herein are techniques, performed by a data processing system, for enabling efficient development of software application programs in a dynamic environment with multiple datasets by generating entries in a dataset catalog to provide a software application program with access to output data dynamically generated by dataflow graphs, the entries associated with respective software application programs developed as dataflow graphs. The techniques include identifying a subgraph, wherein, when the subgraph is executed, the subgraph generates output data by applying one or more data processing operations to data obtained from one or more data sources; creating, in the dataset catalog, a new entry associated with the identified subgraph, the new entry associated with information indicating nodes, links, and configuration parameters of the identified subgraph; and configuring the dataset catalog to enable access to the new entry, in the dataset catalog, associated with the identified subgraph.
-
公开(公告)号:US11782820B2
公开(公告)日:2023-10-10
申请号:US17029828
申请日:2020-09-23
发明人: Joyce L. Vigneau , Mark Staknis , Xin Li
CPC分类号: G06F11/3664 , G06F11/323 , G06F11/362 , G06F11/3636 , G06F11/3656 , G06F11/3696
摘要: A computer-implemented method for debugging an executable control flow graph that specifies control flow among a plurality of functional modules, with the control flow being represented as transitions among the plurality of functional modules, the computer-implemented method including: specifying a position in the executable control flow graph at which execution of the executable control flow graph is to be interrupted; wherein the specified position represents a transition to a given functional module, a transition to a state in which contents of the given functional module are executed or a transition from the given functional module; starting execution of the executable control flow graph in an execution environment; and at a point of execution representing the specified position, interrupting execution of the executable control flow graph; and providing data representing one or more attributes of the execution environment in which the given functional module is being executed.
-
公开(公告)号:US11748165B2
公开(公告)日:2023-09-05
申请号:US16906193
申请日:2020-06-19
IPC分类号: G06F9/50
CPC分类号: G06F9/5038
摘要: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for workload automation and job scheduling information. One of the methods includes obtaining job dependency information, the job dependency information specifying an order of execution of a plurality of jobs. The method also includes obtaining data lineage information that identifies dependency relationships between data stores and transformation, wherein at least one transformation accepts data from a first data store and produces data for a second data store. The method also includes creating links between the job dependency information and the data lineage information. The method also includes determining an impact of a change in a planned execution of an application of the plurality of applications based on the job dependency information, the created links, and the data lineage information.
-
公开(公告)号:US11741091B2
公开(公告)日:2023-08-29
申请号:US15829152
申请日:2017-12-01
发明人: David Clemens , Dusan Radivojevic , Neil Galarneau
IPC分类号: G06F16/245 , G06F16/22 , G06F16/248 , G06F16/83 , G06F40/117
CPC分类号: G06F16/245 , G06F16/22 , G06F16/248 , G06F16/83 , G06F40/117
摘要: Among other things, we describe a method of receiving a portion of metadata from a data source, the portion of metadata describing nodes and edges; generating instances of a data structure representing the portion of metadata, at least one instance of the data structure including an identification value that identifies a corresponding node, one or more property values representing respective properties of the corresponding node, and one or more pointers to respective identification values, each pointer representing an edge associated with a node identified by the corresponding respective identification value; storing the instances of the data structure in random access memory; receiving a query that includes an identification of at least one particular element of data; and using at least one instance of the data structure to cause a display of a computer system to display a representation of lineage of the particular element of data.
-
公开(公告)号:US11720583B2
公开(公告)日:2023-08-08
申请号:US17878106
申请日:2022-08-01
发明人: Ian Schechter , Tim Wakeling , Ann M. Wollrath
IPC分类号: G06F16/24 , G06F16/2458 , G06F16/13 , G06F16/25 , G06F16/28 , G06F16/17 , G06F16/901 , G06F9/50
CPC分类号: G06F16/2471 , G06F9/5066 , G06F16/13 , G06F16/1734 , G06F16/254 , G06F16/285 , G06F16/9024 , G06F16/284
摘要: In a first aspect, a method includes, at a node of a Hadoop cluster, the node storing a first portion of data in HDFS data storage, executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster, receiving a computer-executable program by the data processing engine, executing at least part of the program by the first instance of the data processing engine, receiving, by the data processing engine, a second portion of data from the external data source, storing the second portion of data other than in HDFS storage, and performing, by the data processing engine, a data processing operation identified by the program using at least the first portion of data and the second portion of data.
-
99.
公开(公告)号:US11669343B2
公开(公告)日:2023-06-06
申请号:US17477922
申请日:2021-09-17
发明人: Oded Ravid , Trevor Murphy
IPC分类号: G06F7/00 , G06F9/448 , G06F16/901 , G06F16/2455 , G06F16/178 , G06F9/445 , G06F8/41
CPC分类号: G06F9/4494 , G06F9/44505 , G06F16/1794 , G06F16/24568 , G06F16/9024 , G06F8/433
摘要: A method is described for processing keyed data items that are each associated with a value of a key, the keyed data items being from a plurality of distinct data streams, the processing including collecting the keyed data items, determining, based on contents of at least one of the keyed data items, satisfaction of one or more specified conditions for execution of one or more actions and causing execution of at least one of the one or more actions responsive to the determining.
-
公开(公告)号:US20230100418A1
公开(公告)日:2023-03-30
申请号:US17665109
申请日:2022-02-04
发明人: Dusan Radivojevic , Robert Parks , Adam Weiss , Maja Jankovic , John Vickery
IPC分类号: G06F3/06
摘要: An electronic system for increasing the speed of preparing data with a specified data quality for storage by automatically identifying for a user, with minimal user input, common contexts among (i) fields in disparate datasets, and (ii) names the user has specified as potentially describing the fields, and by using those common contexts to govern the disparate datasets prior to storage to ensure the specified data quality.
-
-
-
-
-
-
-
-
-