摘要:
A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained in the address(es), and standardize the address(es) to an acceptable format, with one or more synonyms and/or other elements being added to or taken away from the input address(es) as part of the standardization process.
摘要:
The present invention relates to data cleansing, and in particular performing the semantic standardization process within a database before the transform portion of the extract-transform-load (ETL) process. Provided are a method, system and computer program product for standardizing data within a database engine, configuring the standardization function to determine at least one standardized value for at least one data value by applying the standardization table in a context of at least one data value, receiving a database query identifying the standardization function, at least one database value and the context of the data, and invoking the standardization function.
摘要:
The present invention relates to data cleansing, and in particular performing the semantic standardization process within a database before the transform portion of the extract-transform-load (ETL) process. Provided are a method, system and computer program product for standardizing data within a database engine, configuring the standardization function to determine at least one standardized value for at least one data value by applying the standardization table in a context of at least one data value, receiving a database query identifying the standardization function, at least one database value and the context of the data, and invoking the standardization function.