Entity Resolution
Name Name of a concept, e.g. a data model concept. In contrast to terms, the name does not depend on a given context, e.g. a country-specific language. | Entity Resolution |
---|---|
Short description Informal and short human-readable definition of a concept. | The CDQ Matching Engine identifies duplicate business partners both within a single dataset and across multiple datasets. Using entity resolution, it compares attributes to determine if records represent the same real-world entity, ensuring data consistency and accuracy. |
Description Informal and comprehensive human-readable definition of a concept. | The CDQ Matching Engine is employed in all services where the goal is to find overlaps between one or more business partners in different datasets. The matching engine leverages a straightforward approach to identify duplicate records or entities within and across datasets. The method is rooted in the concept of entity resolution or identity resolution, which aims to determine whether different records represent the same real-world entity by comparing their attributes. The matching engine is leveraged particular in the following products:
|
CDQ Matching Engine - Leveraging Bayesian Inference for Identity Resolution
The CDQ Matching Engine is employed in all services where the goal is to find overlaps between one or more business partners in different datasets. The matching engine leverages a straightforward approach to identify duplicate records or entities within and across datasets. The method is rooted in the concept of entity resolution or identity resolution, which aims to determine whether different records represent the same real-world entity by comparing their attributes.
Breakdown of the approach
- Selection of Relevant Properties: Selecting attributes which are qualified for establishing the identity of a business partner, i.e. attributes that are best suited for identifying records that represent the identical entity. Out of all the available attributes, a subset that could help distinguish between different entities is chosen, meaning a set of attributes by which business partners can be distinguished from each other (i.e. attributes that establish the identity of an organization) is to be defined.
While some attributes are well qualified for this (e.g. equal names provide strong evidence for the existence of a duplicate, similarly certain business identifiers), others are only suitable in combination with others (e.g. the same city alone provides only very little evidence for the existence of a duplicate, as many organizations may be located in the same locality). Similar name attributes provide strong evidence for the existence of a duplicate. However, there are many aspects that require careful consideration when comparing names. For example, it has to be decided if legal forms are to be considered, how to deal with acronyms (e.g. “BMW”), and which components the name should consist of. The latter aspect is quite important, as comparing legal entity names with organizational information (e.g. division names) or with brand or trade names hardly provides reliable results. Address information is another strong indicator for the existence of duplicate business partner data. However, address attributes are only suitable for this purpose in combination with other attributes. In addition, addresses are characterized by a wide variety of possible representations.
- Fuzzy search for candidates: Employing fuzzy index search (separate complex configurations behind) potential candidates that represent the same entity like the input entity are identified.
- Pairwise Comparison and ad-hoc Cleaning: For each pair of records, the chosen properties are compared on a one-to-one basis. This involves evaluating whether each attribute matches or differs between the two records.
Cleaners harmonize and standardize the input data, e.g. if one data source uses abbreviations, whereas another uses full names only. The selection of attributes and the required harmonization depends on the use case for the identification of duplicates. For example, the consolidation of records that represent parts of one and the same legal entity requires a different treatment than the aggregation of data objects that belong to the same site of a business partner. In the first case, D-U-N-S numbers are a good indicator, as well as company register IDs, tax numbers or names. In the second case, tax numbers, or any kind of legal entity identifier, are not really qualified, as a focus needs to be laid on address attributes (because a business partner may have many locations, but there is e.g. only one legal entity with one tax number). Tax numbers and the like therefore can only provide evidence to a certain extent.
- Bayesian Inference for Probability Estimation: Based on the comparison, a probability is assigned to each property indicating the likelihood that the two records refer to the same entity. For example, a match in names might be given a high probability of indicating the same entity, while a discrepancy in zip codes might lower that probability.
- Overall Probability Calculation: The probabilities for each property are combined using Bayesian inference to calculate an overall probability that the two records represent the same entity.
- Threshold: A threshold probability is set, above which two records are considered duplicates of the same entity. This helps in making the final decision on whether records are likely duplicates.

Why is efficient matching of business partner data challenging
Matching challenges | |
---|---|
Typical challenges when matching business partner names |
|
Typical challenges when matching business partner addresses |
|
CDQ Lookup: Finding matches in connected data sources
Search for candidates
Data that is locally stored in the CDQ infrastructure, i.e. data sources that provide data dumps, is indexed. Based on the input data a fuzzy search is made on the index to identify potential candidates. Additionally data sources that are just connected remotely (e.g. the European VAT Information Exchange System) are queried with the input data and the results are considered as additional candidates.
Candidates are records that could potentially represent a duplicate, means the candidate potentially represents exact the same entity as the entity described by the input data. They are a highly fuzzy selection from the total number of records in a database.
The number of candidates can be customized. The higher the number of candidates, the less likely it is that a fitting record is not considered in subsequent steps. However, the more candidates, the lower the performance of the service because in subsequent steps each candidate has to be evaluated with costly operations (cleaning, comparing). So setting the number of candidates to 1 would result in a speedy response, however if this single candidate found by the fuzzy search does not represent the expected entity you don't get a result, while there might be additional candidates that would fit. On the other hand if you set the candidates to 500 it is likely that the list of candidates comprises the fitting candidate (if any available in the data sources), however the service execution is significantly less performant, because 500 candidates have to be analyzed.
Comparing candidates
For comparing a candidate against the searched entity represented by the input data, individual attributes are pair-wise compared with each other. Before comparing, a set of cleaners is applied on both strings, the comparison then takes places leveraging one of the available string comparison algorithms. The comparison results in a probability that is to be interpreted as: "based on just this single attribute we judge the probability that both records represent the identical entity". Obviouly, if just 1 single attribute is compared, the probability is not that super meaningful. Leveraging Bayes inference all individual probabilities are "chained" to receive a meaningful result.
The attributes that are compared depend on the particular data source and of course on the available input. E.g. if there is no street given in the input, then the street does not provide us with any information whether two records represent the same entity. Thus, missing attributes are always considered as 50% probability that the records represent the identical entity, 50% probability that the two records do NOT represent the identical entity.
For each string comparison a high and low confidence value is defined:
- High = 0.9 (means, if both strings are evaluated to be identical, we are 90% sure that the overall record is a duplicate, i.e. represents the identical real-world entity)
- Low = 0.3 (means, if both strings are evaluated not to be identical, we are only 30% confident that the records are duplicates)
The high and low confidences work as a “flooring” and “ceiling” in case the similarity would be 0.2, then the low value considered in the calculation would be still 0.3 whereas an exact similarity of 1.0 would be “ceiled” by 0.9
Which strings are to be compared, using which comparators after having applied which cleaners and, considering which high and low probabilities is defined in matching configurations. For each data source there is a dedicated matching configuration reflecting the particularities of each country and source. If no special configuration is used a default configuration is applied.
In general the Matching Configuration defines the
- Search
- Define how the search for candidates is performed by specifying which attribute values are used for the search. Additionally it is defined how many candidates the search could return at maximum (see previous section). Note: The higher the number of candidates, the higher the execution time of the matching process as more candidates have to be assessed.
- Assessment
- which attributes are used for assessing the similarity between the given and candidate record
- which cleaners should be applied before comparing the attribute values
- which comparator should be applied for comparing the attribute values
- which probabilities per attribute comparison are employed for calculating the confidence whether the given and candidate record represent the identical entity
Cleaners
Cleaners transform or normalize data before it is effectively compared. So a cleaner's job is to make comparison easier by removing from data values all variations that are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a zip code. Or, it might normalize and lowercase addresses. Or translate dates into a common format.
cleaner | description |
---|---|
Attribute in Attribute Cleaner (Matching) | This cleaner removes some attribute (f.e. locality) from selected attribute, e.g. "Germany" in "Company (Germany) AG". Used especially for removing name local or international from business partner name. |
Country Cleaner (Matching) | |
Digits-only Cleaner (Matching) | Removes everything which is not a digit, e.g. to compare post codes. |
Legal Form Cleaner (Matching) | Special cleaner for business partner names. The cleaner identifies a legal form in the input string and cuts the part BEFORE the legal form. To well-recognize legal forms, the cleaner needs to get some country information. For example, in CDQ AG Factory St. Gallen , AG is identified as legal form and only CDQ is used for matching.
|
Lower-case Normalize Cleaner (Matching) | Most widely used cleaner. It lowercases all letters, removes whitespace characters at beginning and end, and normalizes whitespace characters in between tokens. It also removes accents, e.g. turning é into e, and so on. |
Non-character Cleaner (Matching) | Removes any chars that are not latin characters including numbers. |
Phone Number Cleaner (Matching) | Cleaner for international phone numbers. It assumes that it can get the same phone number in forms like 0047 55301400 , +47 55301400 , 47-55-301400 , or +47 (0) 55301400 . |
Punctuation cleaner (Matching) | |
Replace Cleaner (Matching) | Replaces strings by other strings. Patterns may also comprise regular expressions, and character case can be ignored. To use this cleaner, you have to define the cleaner as a separate object, the following snipped provides an example. Patterns and replacements have to be provided as JSON string.
p : Pattern to replace. The string may comprise a regular expression to define a group of strings to replace, e.g. 0{1,5} for a sequence of 1 to 5 0 characters.r : Replacement, this string is used to replace the pattern, may be empty to just remove the pattern.i : true to ignore the case of the given pattern, else false .isRegex : true means, that pattern is treated as regex pattern. If false then pattern is treated as string literal. If not defined, then default value is false . |
Strip Non-text Characters Cleaner (Matching) | This cleaner trips control characters 0-0x1F and 0x7F-0x9F and special symbols in the range 0xA1 to 0xBF . |
Trim Cleaner (Matching) | This cleaner trims whitespace characters at the beginning and end of the input string. |
Comparators
A Comparator can compare two string values and produce a similarity measure between 0.0 (meaning completely different, or better said: the probability that the two record at all represent the identical entity is 0) and 1.0 (meaning exactly equal, or better said it is 100% sure that the two records at all represent the identical entity if this single attribute is identical). These are used because we need something better than simply knowing whether two values are the same or not. Also, different kinds of values must be compared differently, and comparison of complex strings like names and addresses is a whole discipline in itself.
A Comparator can compare two string values and produce a similarity measure between 0.0 (meaning completely different) and 1.0 (meaning exactly equal). These are used because we need something better than simply knowing whether two values are the same or not. Also, different kinds of values must be compared differently, and comparison of complex strings like names and addresses is a whole discipline in itself.
comparator | description |
---|---|
Exact Comparator (Matching) | Just reports 0.0 if the values are not equal and 1.0 if they are. |
Geoposition Comparator (Matching) | Compares two geographic positions given by coordinates by the distance between them along the earth's surface. It assumes the parameters are of the form 59.917516,10.757933 , where the numbers are degrees latitude and longitude, respectively. The computation simply assumes a sphere with a diameter of 6371 kilometers, so no particular geodetic model is assumed. WGS83 coordinates will work fine, while UTM coordinates will not work. See Duke documentation for details, e.g. how to use a parameter max-distance . |
Jaro Winkler Comparator (Matching) | Jaro–Winkler distance, which have found to be the best available general string comparator for deduplication. Use this for short strings like given names and family names. Not so good for longer, general strings.
The Jaro-Winkler comparator is an adaptation of the Jaro distance. The Jaro distance measures the amount of characters that are common between two strings. Only common characters that are not more in distance than half the length of the longer string are considered. Then it is identified how many of the common characters are at the same position in both strings. The similarity is then calculated utilizing the required transpositions for the set of common characters (Herzog et al., 2010, p. 540; Jaro, 1989). The adaption of Winkler introduces a correction factor, which increases the similarity if both strings share a common prefix. Like the Jaro distance, Winkler developed the measure for comparing names in US census data. Thus, the Jaro-Winkler distance is particularly suitable for name fields (Cohen, Fienberg, Ravikumar, & Fienberg, 2003, p. 74; Winkler, 1988). |
Levenshtein Comparator (Matching) | Most widely used fuzzy comparator. Uses Levenshtein distance to compute the similarity. Basically, it measures the number of edit operations needed to get from string 1 to string 2. |
Longest Common Substring Comparator (Matching) | This comparator does not merely find the longest common substring, but does so repeatedly down to a minimal substring length. See [1] for details. |
Metaphone Comparator (Matching) | Compares field values using Metaphone. |
Soundex Comparator (Matching) | Compares field values using Soundex. |
Weighted Levenshtein Comparator (Matching) | Most widely used fuzzy comparator. Uses Levenshtein distance to compute the similarity. The Weighted Levenshtein distance is based on Levenshtein comparator with difference of having the possibility to weight substrings and give it a preference compared to the rest of the string. Typical scenario might be a street value holding street name and house number together. In such case, house number is slightly more important than the name. So you can assign to digits weight (preference) which will lead to better matching results. |
q-Gram Comparator (Matching) | Uses n-grams of field values to calculate their similarity. It seems to be similar to Levenshtein, but a bit more eager to consider strings the same, and doesn't care what order tokens are in. So for strings consisting of tokens that may be reordered (e.g. "Hotel Lindner Hamburg" and "Lindner Hotel Hamburg") it may be a better alternative than Levenshtein. May be further configured by q -Parameter to specify the size of the n-grams. Default size is 3 which is fine for must business partner use cases. |
Decision: Matching score threshold
Each match when searching for company data comes with an overall matching score. The higher the score, the more confidence that the found record represents the actual searched entity. By default all matches with a score greater than 0,5 are returned. A score of 0,5 is to be interpreted that there is a 50% confidence that the record is the right one, and a 50% confidence that the record represents a different entity. The higher you set the threshold, the less matches you will receive, but the then remaining ones with a higher confidence.
Duplicate Guard - Consolidation of identified duplicates
The Duplicate Guard allows to define how records in a duplicate matching group (entities representing the same real world entity) are consolidated into a "best guess" or "golden" record.
Consolidating a group of records is done in two steps:
- "Base Record Selection"
- "Consolidated Record Creation"
Each step can be configured by defining rules that determine what should be done. The default behaviour if no configuration is provided:
- "Base Record Selection" is skipped
- "Consolidated Record Creation" is done by choosing for each attribute the value that occurs most often
Base Record Selection
From a list of records a base record can be determined by using a binary decision tree.
Each node has a condition in form of a rule that divides the group into the records that fulfil the condition, and the ones that dissatisfy it.
Order of execution of the tree and its node is from top to down and left to right.
This means for the rules the implicit a «priority» from top to bottom and left to right.
- Only one branch will be evaluated further
- Possible decisions:
- Rule 1
- Rule 1 → Rule 2
- Rule 1 → Rule 3
- Rule 1 → Rule 3 → Rule 4
- Rule 1 → Rule 3 → Rule 5
A rule is defined by:
Name | Description | Is Mandatory? |
---|---|---|
type | Defines the type of the rule to be applied | YES |
jsonPath | Path to the record attribute that should be checked by the rule | YES |
valueType | Used if the value defined by the json path needs to be cast | NO |
expectedValues | In case the rule type indicates that the record attribute should be checked against some constant | NO |
Type of rules to be used as condition in a node are:
- HAS_VALUE
- HAS_NOT_VALUE
- HAS_VALUE_IN
- HAS_NOT_VALUE_IN
- HAS_VALUE_LIKE
- HAS_NOT_VALUE_LIKE
- HAS_HIGHEST_VALUE
- HAS_LOWEST_VALUE
- IS_MARKED
- IS_NOT_MARKED
- IS_MOST_COMPLETE
- HAS_HIGHEST_MATCH_SCORE
Examples
In case a record must have no value in a certain attribute:
"rule": { "type": "IS_NOT_MARKED", "jsonPath": "DEL_FLAG" }
In case a record must have a certain value in an attribute:
"rule": { "type": "HAS_VALUE", "jsonPath": "KTOKD", "expectedValues": [ "ZDFO" ] }
In case a record should have the smallest value among the group in a certain attribute:
"rule": { "type": "HAS_LOWEST_VALUE", "jsonPath": "External Id", "valueType": "NUMBER" }
Consolidated Record Creation
The actual consolidation of all the records is done in four steps:
- Exclude/Filter unwanted records
- Protect attributes from the base record that should not be overwritten
- Add/Replace attributes in the golden record with attributes from the other records based on the defined rules
- Apply default handling for attributes for which no rule was defined in the step before (MAJORITY)
Excluding Records
The use case for this step is to exclude records where the quality of the data is questionable. For example the records is already marked for deletion or wasn’t updated for a long time. Available rules types are:
- HAS_VALUE
- HAS_NOT_VALUE
- HAS_VALUE_IN
- HAS_NOT_VALUE_IN
- HAS_VALUE_LIKE
- HAS_NOT_VALUE_LIKE
- IS_MARKED
- IS_NOT_MARKED
Example
"recordExclusionRules": [ { "type": "IS_MARKED", "jsonPath": "DEL_FLAG" } ]
Protecting Attributes
This is an optional step that only makes sense in case a base record has been selected. A typical use case is that a company wants to keep one record of the group in the system and therefore wants to keep (meta)data only relevant for this particular record (e.g. the ID, date of creation, ...). Technically it is just a list of attribute names that will be protected from any further alteration.
Example
"protectedAttributes": [ "Storage ID" "Data Source ID", "External Id", "ERDAT" ]
Apply Attribute Enhancement Rules
This is the most important and "powerful" step in the GR creation flow. It allows simple manipulations like:
- removing a certain attribute
- setting a value from a record that was flagged
- or setting the majority value among all records
Sometimes more complex logics are needed. For this reason a chain of constraints The structure of an AttributeEnhancementRule is the following:
Name | Description | Is Mandatory? |
---|---|---|
attributes | List of attributes that are affected by the rule | YES |
constraintChain | A chain of rules (same as selection rules or exclusion rules) to filter for fitting records | |
creationType | Allowed values are:
|
NO |
creationConstant | Define a constant to be set as value for an attribute | NO (Only in combination with creationType=CONSTANT) |
executionType | Describes what is done in case the constraints are fulfilled. Allowed values are:
|
YES |
fallbackExecutionType | In case the constraints are not met this is the option to be executed (instead of executionType). Allowed values are:
|
NO |