Entity Resolution

From CDQ
Capability/entity resolution
Jump to navigation Jump to search
Name Name of a concept, e.g. a data model concept. In contrast to terms, the name does not depend on a given context, e.g. a country-specific language. Entity Resolution
Short description Informal and short human-readable definition of a concept. The CDQ Matching Engine identifies duplicate business partners both within a single dataset and across multiple datasets. Using entity resolution, it compares attributes to determine if records represent the same real-world entity, ensuring data consistency and accuracy.
Description Informal and comprehensive human-readable definition of a concept. The CDQ Matching Engine is employed in all services where the goal is to find overlaps between one or more business partners in different datasets. The matching engine leverages a straightforward approach to identify duplicate records or entities within and across datasets. The method is rooted in the concept of entity resolution or identity resolution, which aims to determine whether different records represent the same real-world entity by comparing their attributes. The matching engine is leveraged particular in the following products:
  • CDQ Lookup: Finding matching business partners in connected data sources
  • CDQ Partner Guard: Record linkage enabling business partner updates, for example in the CDQ Data Sharing Pool
  • CDQ Duplicate Guard: Identifying duplicates in a single dataset or across datasets
  • CDQ AML Guard: Searching for matching entries in sanctions and watchlists

CDQ Matching Engine - Leveraging Bayesian Inference for Identity Resolution

The CDQ Matching Engine is employed in all services where the goal is to find overlaps between one or more business partners in different datasets. The matching engine leverages a straightforward approach to identify duplicate records or entities within and across datasets. The method is rooted in the concept of entity resolution or identity resolution, which aims to determine whether different records represent the same real-world entity by comparing their attributes.

Breakdown of the approach

  • Selection of Relevant Properties: Selecting attributes which are qualified for establishing the identity of a business partner, i.e. attributes that are best suited for identifying records that represent the identical entity. Out of all the available attributes, a subset that could help distinguish between different entities is chosen, meaning a set of attributes by which business partners can be distinguished from each other (i.e. attributes that establish the identity of an organization) is to be defined.

While some attributes are well qualified for this (e.g. equal names provide strong evidence for the existence of a duplicate, similarly certain business identifiers), others are only suitable in combination with others (e.g. the same city alone provides only very little evidence for the existence of a duplicate, as many organizations may be located in the same locality). Similar name attributes provide strong evidence for the existence of a duplicate. However, there are many aspects that require careful consideration when comparing names. For example, it has to be decided if legal forms are to be considered, how to deal with acronyms (e.g. “BMW”), and which components the name should consist of. The latter aspect is quite important, as comparing legal entity names with organizational information (e.g. division names) or with brand or trade names hardly provides reliable results. Address information is another strong indicator for the existence of duplicate business partner data. However, address attributes are only suitable for this purpose in combination with other attributes. In addition, addresses are characterized by a wide variety of possible representations.

  • Fuzzy search for candidates: Employing fuzzy index search (separate complex configurations behind) potential candidates that represent the same entity like the input entity are identified.
  • Pairwise Comparison and ad-hoc Cleaning: For each pair of records, the chosen properties are compared on a one-to-one basis. This involves evaluating whether each attribute matches or differs between the two records.

Cleaners harmonize and standardize the input data, e.g. if one data source uses abbreviations, whereas another uses full names only. The selection of attributes and the required harmonization depends on the use case for the identification of duplicates. For example, the consolidation of records that represent parts of one and the same legal entity requires a different treatment than the aggregation of data objects that belong to the same site of a business partner. In the first case, D-U-N-S numbers are a good indicator, as well as company register IDs, tax numbers or names. In the second case, tax numbers, or any kind of legal entity identifier, are not really qualified, as a focus needs to be laid on address attributes (because a business partner may have many locations, but there is e.g. only one legal entity with one tax number). Tax numbers and the like therefore can only provide evidence to a certain extent.

  • Bayesian Inference for Probability Estimation: Based on the comparison, a probability is assigned to each property indicating the likelihood that the two records refer to the same entity. For example, a match in names might be given a high probability of indicating the same entity, while a discrepancy in zip codes might lower that probability.
  • Overall Probability Calculation: The probabilities for each property are combined using Bayesian inference to calculate an overall probability that the two records represent the same entity.
  • Threshold: A threshold probability is set, above which two records are considered duplicates of the same entity. This helps in making the final decision on whether records are likely duplicates.


Duplicate Identification Overview.png


Why is efficient matching of business partner data challenging

Matching challenges
Typical challenges when matching business partner names
  • The representation of the name is not standardized. While for some names abbreviations are used (e.g. “Inst. of Information Management”), others include legal forms (“ZF Friedrichshafen AG”) or are written in uppercase or lowercase letters only (“BAYER AG”).
  • The name is represented in different characters (e.g. Chinese, Cyrillic, and/or Latin characters).
  • The name includes misplaced information. For example, c/o information is added to the name instead of the respective data attribute.
  • In some cases, acronyms are used, while in other cases the full name is used (e.g. “BMW AG” vs. “Bayerische Motorenwerke AG”).
  • Misspellings may occur in various forms: characters added (e.g. “Bayern AG” instead of “Bayer AG”), characters omitted (e.g. “Byer AG”), characters replaced (e.g. “Baier AG”), or characters transposed (e.g. “Bayre AG”).
  • The order of name components is inconsistent (e.g. “Lindner Hotel Hamburg” vs. “Hotel Lindner in Hamburg”).
  • The name is represented differently (e.g. only one attribute “business partner name” in one data model vs. “name 1-5” in SAP’s data model).
Typical challenges when matching business partner addresses
  • Misspellings of cities or thoroughfares (see above).
  • No consistent use of abbreviations (e.g. “Lindenstr.” vs. “Lindenstrasse”).
  • Misplaced information (e.g. c/o information or building information is included in the thoroughfare, or the house number is sometimes included in the street name and sometimes placed separately).
  • Missing attributes (e.g. one system includes building information, another does not, or data was just not maintained).
  • Original names of cities vs. international names (e.g. “München” vs. “Munich”, “Mailand” vs. “Milano”).
  • Used characters (e.g. Chinese etc., see above).
  • Semantic ambiguities for certain fields (e.g. different post codes available in Ireland: Eircode vs. GeoDirectory vs. Loc8 Code).
  • Post box addresses vs. street addresses (sometimes they are maintained separately, sometimes the street and post box address are maintained in one address data object).





CDQ Lookup: Finding matches in connected data sources

Search for candidates

Data that is locally stored in the CDQ infrastructure, i.e. data sources that provide data dumps, is indexed. Based on the input data a fuzzy search is made on the index to identify potential candidates. Additionally data sources that are just connected remotely (e.g. the European VAT Information Exchange System) are queried with the input data and the results are considered as additional candidates.

Candidates are records that could potentially represent a duplicate, means the candidate potentially represents exact the same entity as the entity described by the input data. They are a highly fuzzy selection from the total number of records in a database.

The number of candidates can be customized. The higher the number of candidates, the less likely it is that a fitting record is not considered in subsequent steps. However, the more candidates, the lower the performance of the service because in subsequent steps each candidate has to be evaluated with costly operations (cleaning, comparing). So setting the number of candidates to 1 would result in a speedy response, however if this single candidate found by the fuzzy search does not represent the expected entity you don't get a result, while there might be additional candidates that would fit. On the other hand if you set the candidates to 500 it is likely that the list of candidates comprises the fitting candidate (if any available in the data sources), however the service execution is significantly less performant, because 500 candidates have to be analyzed.


Comparing candidates

For comparing a candidate against the searched entity represented by the input data, individual attributes are pair-wise compared with each other. Before comparing, a set of cleaners is applied on both strings, the comparison then takes places leveraging one of the available string comparison algorithms. The comparison results in a probability that is to be interpreted as: "based on just this single attribute we judge the probability that both records represent the identical entity". Obviouly, if just 1 single attribute is compared, the probability is not that super meaningful. Leveraging Bayes inference all individual probabilities are "chained" to receive a meaningful result.

The attributes that are compared depend on the particular data source and of course on the available input. E.g. if there is no street given in the input, then the street does not provide us with any information whether two records represent the same entity. Thus, missing attributes are always considered as 50% probability that the records represent the identical entity, 50% probability that the two records do NOT represent the identical entity.

For each string comparison a high and low confidence value is defined:

  • High = 0.9 (means, if both strings are evaluated to be identical, we are 90% sure that the overall record is a duplicate, i.e. represents the identical real-world entity)
  • Low = 0.3 (means, if both strings are evaluated not to be identical, we are only 30% confident that the records are duplicates)

The high and low confidences work as a “flooring” and “ceiling” in case the similarity would be 0.2, then the low value considered in the calculation would be still 0.3 whereas an exact similarity of 1.0 would be “ceiled” by 0.9

Which strings are to be compared, using which comparators after having applied which cleaners and, considering which high and low probabilities is defined in matching configurations. For each data source there is a dedicated matching configuration reflecting the particularities of each country and source. If no special configuration is used a default configuration is applied.

In general the Matching Configuration defines the

  • Search
    • Define how the search for candidates is performed by specifying which attribute values are used for the search. Additionally it is defined how many candidates the search could return at maximum (see previous section). Note: The higher the number of candidates, the higher the execution time of the matching process as more candidates have to be assessed.
  • Assessment
    • which attributes are used for assessing the similarity between the given and candidate record
    • which cleaners should be applied before comparing the attribute values
    • which comparator should be applied for comparing the attribute values
    • which probabilities per attribute comparison are employed for calculating the confidence whether the given and candidate record represent the identical entity


Cleaners

Cleaners transform or normalize data before it is effectively compared. So a cleaner's job is to make comparison easier by removing from data values all variations that are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a zip code. Or, it might normalize and lowercase addresses. Or translate dates into a common format.

cleanerdescription
Attribute in Attribute Cleaner (Matching)This cleaner removes some attribute (f.e. locality) from selected attribute, e.g. "Germany" in "Company (Germany) AG". Used especially for removing name local or international from business partner name.
Country Cleaner (Matching)
Digits-only Cleaner (Matching)Removes everything which is not a digit, e.g. to compare post codes.
Legal Form Cleaner (Matching)Special cleaner for business partner names. The cleaner identifies a legal form in the input string and cuts the part BEFORE the legal form. To well-recognize legal forms, the cleaner needs to get some country information. For example, in CDQ AG Factory St. Gallen, AG is identified as legal form and only CDQ is used for matching.
  • cleaner: The LegalFormCleaner must be applied BEFORE lower case normalization (i.e. LowerCaseNormalizeCleaner) because legal forms are matched case sensitive.
  • configProperty: A record-specific parameter for the cleaner, in this case the value of the record from COUNTRY_SHORTNAME property.
Lower-case Normalize Cleaner (Matching)Most widely used cleaner. It lowercases all letters, removes whitespace characters at beginning and end, and normalizes whitespace characters in between tokens. It also removes accents, e.g. turning é into e, and so on.
Non-character Cleaner (Matching)Removes any chars that are not latin characters including numbers.
Phone Number Cleaner (Matching)Cleaner for international phone numbers. It assumes that it can get the same phone number in forms like 0047 55301400, +47 55301400, 47-55-301400, or +47 (0) 55301400.
Punctuation cleaner (Matching)
Replace Cleaner (Matching)Replaces strings by other strings. Patterns may also comprise regular expressions, and character case can be ignored. To use this cleaner, you have to define the cleaner as a separate object, the following snipped provides an example. Patterns and replacements have to be provided as JSON string.
  • p: Pattern to replace. The string may comprise a regular expression to define a group of strings to replace, e.g. 0{1,5} for a sequence of 1 to 5 0 characters.
  • r: Replacement, this string is used to replace the pattern, may be empty to just remove the pattern.
  • i: true to ignore the case of the given pattern, else false.
  • isRegex: true means, that pattern is treated as regex pattern. If false then pattern is treated as string literal. If not defined, then default value is false.
  • Strip Non-text Characters Cleaner (Matching)This cleaner trips control characters 0-0x1F and 0x7F-0x9F and special symbols in the range 0xA1 to 0xBF.
    Trim Cleaner (Matching)This cleaner trims whitespace characters at the beginning and end of the input string.


    Comparators

    A Comparator can compare two string values and produce a similarity measure between 0.0 (meaning completely different, or better said: the probability that the two record at all represent the identical entity is 0) and 1.0 (meaning exactly equal, or better said it is 100% sure that the two records at all represent the identical entity if this single attribute is identical). These are used because we need something better than simply knowing whether two values are the same or not. Also, different kinds of values must be compared differently, and comparison of complex strings like names and addresses is a whole discipline in itself.

    A Comparator can compare two string values and produce a similarity measure between 0.0 (meaning completely different) and 1.0 (meaning exactly equal). These are used because we need something better than simply knowing whether two values are the same or not. Also, different kinds of values must be compared differently, and comparison of complex strings like names and addresses is a whole discipline in itself.

    comparatordescription
    Exact Comparator (Matching)Just reports 0.0 if the values are not equal and 1.0 if they are.
    Geoposition Comparator (Matching)Compares two geographic positions given by coordinates by the distance between them along the earth's surface. It assumes the parameters are of the form 59.917516,10.757933, where the numbers are degrees latitude and longitude, respectively. The computation simply assumes a sphere with a diameter of 6371 kilometers, so no particular geodetic model is assumed. WGS83 coordinates will work fine, while UTM coordinates will not work. See Duke documentation for details, e.g. how to use a parameter max-distance.
    Jaro Winkler Comparator (Matching)Jaro–Winkler distance, which have found to be the best available general string comparator for deduplication. Use this for short strings like given names and family names. Not so good for longer, general strings.

    The Jaro-Winkler comparator is an adaptation of the Jaro distance. The Jaro distance measures the amount of characters that are common between two strings. Only common characters that are not more in distance than half the length of the longer string are considered. Then it is identified how many of the common characters are at the same position in both strings. The similarity is then calculated utilizing the required transpositions for the set of common characters (Herzog et al., 2010, p. 540; Jaro, 1989).

    The adaption of Winkler introduces a correction factor, which increases the similarity if both strings share a common prefix. Like the Jaro distance, Winkler developed the measure for comparing names in US census data. Thus, the Jaro-Winkler distance is particularly suitable for name fields (Cohen, Fienberg, Ravikumar, & Fienberg, 2003, p. 74; Winkler, 1988).
    Levenshtein Comparator (Matching)Most widely used fuzzy comparator. Uses Levenshtein distance to compute the similarity. Basically, it measures the number of edit operations needed to get from string 1 to string 2.
    Longest Common Substring Comparator (Matching)This comparator does not merely find the longest common substring, but does so repeatedly down to a minimal substring length. See [1] for details.
    Metaphone Comparator (Matching)Compares field values using Metaphone.
    Soundex Comparator (Matching)Compares field values using Soundex.
    Weighted Levenshtein Comparator (Matching)Most widely used fuzzy comparator. Uses Levenshtein distance to compute the similarity. The Weighted Levenshtein distance is based on Levenshtein comparator with difference of having the possibility to weight substrings and give it a preference compared to the rest of the string. Typical scenario might be a street value holding street name and house number together. In such case, house number is slightly more important than the name. So you can assign to digits weight (preference) which will lead to better matching results.
    q-Gram Comparator (Matching)Uses n-grams of field values to calculate their similarity. It seems to be similar to Levenshtein, but a bit more eager to consider strings the same, and doesn't care what order tokens are in. So for strings consisting of tokens that may be reordered (e.g. "Hotel Lindner Hamburg" and "Lindner Hotel Hamburg") it may be a better alternative than Levenshtein. May be further configured by q-Parameter to specify the size of the n-grams. Default size is 3 which is fine for must business partner use cases.


    Decision: Matching score threshold

    Each match when searching for company data comes with an overall matching score. The higher the score, the more confidence that the found record represents the actual searched entity. By default all matches with a score greater than 0,5 are returned. A score of 0,5 is to be interpreted that there is a 50% confidence that the record is the right one, and a 50% confidence that the record represents a different entity. The higher you set the threshold, the less matches you will receive, but the then remaining ones with a higher confidence.





    Duplicate Guard - Consolidation of identified duplicates

    The Duplicate Guard allows to define how records in a duplicate matching group (entities representing the same real world entity) are consolidated into a "best guess" or "golden" record.

    Consolidating a group of records is done in two steps:

    1. "Base Record Selection"
    2. "Consolidated Record Creation"

    Each step can be configured by defining rules that determine what should be done. The default behaviour if no configuration is provided:

    • "Base Record Selection" is skipped
    • "Consolidated Record Creation" is done by choosing for each attribute the value that occurs most often

    Base Record Selection

    From a list of records a base record can be determined by using a binary decision tree.
    Each node has a condition in form of a rule that divides the group into the records that fulfil the condition, and the ones that dissatisfy it.

    Base Record Selection



    Order of execution of the tree and its node is from top to down and left to right.
    This means for the rules the implicit a «priority» from top to bottom and left to right.

    Rules tree.png



    • Only one branch will be evaluated further
    • Possible decisions:
      • Rule 1
      • Rule 1 → Rule 2
      • Rule 1 → Rule 3
      • Rule 1 → Rule 3 → Rule 4
      • Rule 1 → Rule 3 → Rule 5

    A rule is defined by:

    Name Description Is Mandatory?
    type Defines the type of the rule to be applied YES
    jsonPath Path to the record attribute that should be checked by the rule YES
    valueType Used if the value defined by the json path needs to be cast NO
    expectedValues In case the rule type indicates that the record attribute should be checked against some constant NO

    Type of rules to be used as condition in a node are:

    • HAS_VALUE
    • HAS_NOT_VALUE
    • HAS_VALUE_IN
    • HAS_NOT_VALUE_IN
    • HAS_VALUE_LIKE
    • HAS_NOT_VALUE_LIKE
    • HAS_HIGHEST_VALUE
    • HAS_LOWEST_VALUE
    • IS_MARKED
    • IS_NOT_MARKED
    • IS_MOST_COMPLETE
    • HAS_HIGHEST_MATCH_SCORE


    Examples

    In case a record must have no value in a certain attribute:

      "rule": {
         "type": "IS_NOT_MARKED",
         "jsonPath": "DEL_FLAG"
      } 
    


    In case a record must have a certain value in an attribute:

       "rule": {
           "type": "HAS_VALUE",
           "jsonPath": "KTOKD",
           "expectedValues": [
                "ZDFO"
          ]
       }
    


    In case a record should have the smallest value among the group in a certain attribute:

       "rule": {
           "type": "HAS_LOWEST_VALUE",
           "jsonPath": "External Id",
           "valueType": "NUMBER"
       }
    


    Consolidated Record Creation

    The actual consolidation of all the records is done in four steps:

    1. Exclude/Filter unwanted records
    2. Protect attributes from the base record that should not be overwritten
    3. Add/Replace attributes in the golden record with attributes from the other records based on the defined rules
    4. Apply default handling for attributes for which no rule was defined in the step before (MAJORITY)


    Excluding Records

    The use case for this step is to exclude records where the quality of the data is questionable. For example the records is already marked for deletion or wasn’t updated for a long time. Available rules types are:

    • HAS_VALUE
    • HAS_NOT_VALUE
    • HAS_VALUE_IN
    • HAS_NOT_VALUE_IN
    • HAS_VALUE_LIKE
    • HAS_NOT_VALUE_LIKE
    • IS_MARKED
    • IS_NOT_MARKED


    Example

       "recordExclusionRules": [
           {
             "type": "IS_MARKED",
             "jsonPath": "DEL_FLAG"
           }
       ]
    


    Protecting Attributes

    This is an optional step that only makes sense in case a base record has been selected. A typical use case is that a company wants to keep one record of the group in the system and therefore wants to keep (meta)data only relevant for this particular record (e.g. the ID, date of creation, ...). Technically it is just a list of attribute names that will be protected from any further alteration.


    Example

       "protectedAttributes": [
            "Storage ID"
            "Data Source ID",
            "External Id",
            "ERDAT"
       ]
    


    Apply Attribute Enhancement Rules

    This is the most important and "powerful" step in the GR creation flow. It allows simple manipulations like:

    • removing a certain attribute
    • setting a value from a record that was flagged
    • or setting the majority value among all records

    Sometimes more complex logics are needed. For this reason a chain of constraints The structure of an AttributeEnhancementRule is the following:

    Name Description Is Mandatory?
    attributes List of attributes that are affected by the rule YES
    constraintChain A chain of rules (same as selection rules or exclusion rules) to filter for fitting records
    creationType Allowed values are:
    • MAJORITY (Determine the most frequent value)
    • JOIN_DISTINCT (Join all distinct values to a list)
    • CONSTANT (set the value defined in "creationConstant")
    NO
    creationConstant Define a constant to be set as value for an attribute NO (Only in combination with creationType=CONSTANT)
    executionType Describes what is done in case the constraints are fulfilled. Allowed values are:
    • ENRICH (set if value is empty)
    • SET (add or replace the current value)
    • REMOVE (just deletes the current value if any)
    YES
    fallbackExecutionType In case the constraints are not met this is the option to be executed (instead of executionType). Allowed values are:
    • ENRICH (set if value is empty)
    • SET (add or replace the current value)
    • REMOVE (just deletes the current value if any)
    NO


    Examples