Entity resolution
Entity resolution (also called identity resolution) determines whether different records represent the same real-world organization. In CDQ's data sharing context, this capability ensures that shared updates and validations refer to the correct entities, preventing duplication and confusion.
CDQ's matching engine applies a combination of deterministic identifiers and probabilistic similarity logic to detect duplicates, overlaps, and hidden relationships across datasets. This process allows customers to link their data mirror records to external references and shared records in the CDQ community. Reliable entity resolution is therefore a prerequisite for meaningful data sharing, enabling customers to recognize when another participant has already validated or corrected the same organization.
CDQ matching engine: Leveraging bayesian inference for identity resolution
The CDQ matching engine is employed in all services where the goal is to find overlaps between one or more business partners in different datasets. The matching engine leverages a straightforward approach to identify duplicate records or entities within and across datasets. The method is rooted in the concept of entity resolution or identity resolution, which aims to determine whether different records represent the same real-world entity by comparing their attributes.
Breakdown of the approach
- Selection of relevant properties: Selecting attributes which are qualified for establishing the identity of a business partner, i.e. attributes that are best suited for identifying records that represent the identical entity. Out of all the available attributes, a subset that could help distinguish between different entities is chosen, meaning a set of attributes by which business partners can be distinguished from each other (i.e. attributes that establish the identity of an organization) is to be defined.
While some attributes are well qualified for this (e.g. equal names provide strong evidence for the existence of a duplicate, similarly certain business identifiers), others are only suitable in combination with others (e.g. the same city alone provides only very little evidence for the existence of a duplicate, as many organizations may be located in the same locality). Similar name attributes provide strong evidence for the existence of a duplicate. However, there are many aspects that require careful consideration when comparing names. For example, it has to be decided if legal forms are to be considered, how to deal with acronyms (e.g. “BMW”), and which components the name should consist of. The latter aspect is quite important, as comparing legal entity names with organizational information (e.g. division names) or with brand or trade names hardly provides reliable results. Address information is another strong indicator for the existence of duplicate business partner data. However, address attributes are only suitable for this purpose in combination with other attributes. In addition, addresses are characterized by a wide variety of possible representations.
- Fuzzy search for candidates: Employing fuzzy index search (separate complex configurations behind) potential candidates that represent the same entity like the input entity are identified.
- Pairwise comparison and ad-hoc cleaning: For each pair of records, the chosen properties are compared on a one-to-one basis. This involves evaluating whether each attribute matches or differs between the two records.
Cleaners harmonize and standardize the input data, e.g. if one data source uses abbreviations, whereas another uses full names only. The selection of attributes and the required harmonization depends on the use case for the identification of duplicates. For example, the consolidation of records that represent parts of one and the same legal entity requires a different treatment than the aggregation of data objects that belong to the same site of a business partner. In the first case, D-U-N-S numbers are a good indicator, as well as company register IDs, tax numbers or names. In the second case, tax numbers, or any kind of legal entity identifier, are not really qualified, as a focus needs to be laid on address attributes (because a business partner may have many locations, but there is e.g. only one legal entity with one tax number). Tax numbers and the like therefore can only provide evidence to a certain extent.
- Bayesian Inference for probability estimation: Based on the comparison, a probability is assigned to each property indicating the likelihood that the two records refer to the same entity. For example, a match in names might be given a high probability of indicating the same entity, while a discrepancy in zip codes might lower that probability.
- Overall probability calculation: The probabilities for each property are combined using Bayesian inference to calculate an overall probability that the two records represent the same entity.
- Threshold: A threshold probability is set, above which two records are considered duplicates of the same entity. This helps in making the final decision on whether records are likely duplicates.
Why is efficient matching of business partner data challenging
| Matching challenges | |
|---|---|
| Typical challenges when matching business partner names |
|
| Typical challenges when matching business partner addresses |
|
CDQ lookup: Finding matches in connected data sources
Search for candidates
Data that is locally stored in the CDQ infrastructure, i.e. data sources that provide data dumps, is indexed. Based on the input data a fuzzy search is made on the index to identify potential candidates. Additionally data sources that are just connected remotely (e.g. the European VAT Information Exchange System) are queried with the input data and the results are considered as additional candidates.
Candidates are records that could potentially represent a duplicate, means the candidate potentially represents exact the same entity as the entity described by the input data. They are a highly fuzzy selection from the total number of records in a database.
The number of candidates can be customized. The higher the number of candidates, the less likely it is that a fitting record is not considered in subsequent steps. However, the more candidates, the lower the performance of the service because in subsequent steps each candidate has to be evaluated with costly operations (cleaning, comparing). So setting the number of candidates to 1 would result in a speedy response, however if this single candidate found by the fuzzy search does not represent the expected entity you don't get a result, while there might be additional candidates that would fit. On the other hand if you set the candidates to 500 it is likely that the list of candidates comprises the fitting candidate (if any available in the data sources), however the service execution is significantly less performant, because 500 candidates have to be analyzed.
Comparing candidates
For comparing a candidate against the searched entity represented by the input data, individual attributes are pair-wise compared with each other. Before comparing, a set of cleaners is applied on both strings, the comparison then takes places leveraging one of the available string comparison algorithms. The comparison results in a probability that is to be interpreted as: "based on just this single attribute we judge the probability that both records represent the identical entity". Obviouly, if just 1 single attribute is compared, the probability is not that super meaningful. Leveraging Bayes inference all individual probabilities are "chained" to receive a meaningful result.
The attributes that are compared depend on the particular data source and of course on the available input. E.g. if there is no street given in the input, then the street does not provide us with any information whether two records represent the same entity. Thus, missing attributes are always considered as 50% probability that the records represent the identical entity, 50% probability that the two records do NOT represent the identical entity.
For each string comparison a high and low confidence value is defined:
- High = 0.9 (means, if both strings are evaluated to be identical, we are 90% sure that the overall record is a duplicate, i.e. represents the identical real-world entity)
- Low = 0.3 (means, if both strings are evaluated not to be identical, we are only 30% confident that the records are duplicates)
The high and low confidences work as a “flooring” and “ceiling” in case the similarity would be 0.2, then the low value considered in the calculation would be still 0.3 whereas an exact similarity of 1.0 would be “ceiled” by 0.9
Which strings are to be compared, using which comparators after having applied which cleaners and, considering which high and low probabilities is defined in matching configurations. For each data source there is a dedicated matching configuration reflecting the particularities of each country and source. If no special configuration is used a default configuration is applied.
In general the Matching Configuration defines the
- Search
- Define how the search for candidates is performed by specifying which attribute values are used for the search. Additionally it is defined how many candidates the search could return at maximum (see previous section). Note: The higher the number of candidates, the higher the execution time of the matching process as more candidates have to be assessed.
- Assessment
- which attributes are used for assessing the similarity between the given and candidate record
- which cleaners should be applied before comparing the attribute values
- which comparator should be applied for comparing the attribute values
- which probabilities per attribute comparison are employed for calculating the confidence whether the given and candidate record represent the identical entity
Cleaners
Cleaners transform or normalize data before it is effectively compared. So a cleaner's job is to make comparison easier by removing from data values all variations that are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a zip code, or normalize and lowercase addresses, or translate dates into a common format.
| Cleaner | Description |
|---|---|
| Attribute in Attribute Cleaner | This cleaner removes some attribute (for example locality) from a selected attribute, e.g. "Germany" in "Company (Germany) AG". Used especially for removing name local or international parts from business partner names. |
| Country Cleaner | Removes country names from source. |
| Digits-only Cleaner | Removes everything which is not a digit, e.g. to compare post codes. |
| Legal Form Cleaner | Special cleaner for business partner names. The cleaner identifies a legal form in the input string and cuts the part BEFORE the legal form. To well-recognize legal forms, the cleaner needs to get some country information.
For example, in CDQ AG Factory St. Gallen, AG is identified as the legal form and only CDQ is used for matching.
|
| Lower-case Normalize Cleaner | Most widely used cleaner. It lowercases all letters, removes whitespace characters at the beginning and end, and normalizes whitespace characters in between tokens. It also removes accents (e.g. turning é into e). |
| Non-character Cleaner | Removes any characters that are not Latin characters, including numbers. |
| Phone Number Cleaner | Cleaner for international phone numbers. It assumes that it can get the same phone number in forms like 0047 55301400, +47 55301400, 47-55-301400, or +47 (0) 55301400.
|
| Punctuation Cleaner | Removes punctuation marks from a given string. |
| Replace Cleaner | Replaces strings by other strings. Patterns may also comprise regular expressions, and character case can be ignored. To use this cleaner, you have to define the cleaner as a separate object.
Patterns and replacements have to be provided as JSON string.
|
| Strip Non-text Characters Cleaner | This cleaner strips control characters 0–0x1F and 0x7F–0x9F and special symbols in the range 0xA1–0xBF. |
| Trim Cleaner | This cleaner trims whitespace characters at the beginning and end of the input string. |
Comparators
A Comparator can compare two string values and produce a similarity measure between 0.0 (completely different) and 1.0 (exactly equal). These are used because we need something better than simply knowing whether two values are the same or not. Also, different kinds of values must be compared differently, and comparison of complex strings like names and addresses is a whole discipline in itself.
| comparator | description |
|---|---|
| Exact Comparator (Matching) | Just reports 0.0 if the values are not equal and 1.0 if they are. |
| Geoposition Comparator (Matching) | Compares two geographic positions given by coordinates based on the distance between them along the earth's surface. It assumes parameters of the form 59.917516,10.757933, where the numbers are latitude and longitude. Computation assumes a sphere with a diameter of 6371 km (no geodetic model). WGS83 coordinates will work; UTM will not. |
| Jaro Winkler Comparator (Matching) | Jaro–Winkler distance, found to be one of the best general string comparators for deduplication.
Use this for short strings like given names and family names. The Jaro distance measures common characters not more distant than half the longer string length. The similarity is based on transpositions of common characters. Winkler’s adaptation adds a correction factor that increases similarity for shared prefixes, making it well suited for names. |
| Levenshtein Comparator (Matching) | Most widely used fuzzy comparator. Uses Levenshtein distance to compute similarity, measuring the number of edit operations needed to get from one string to the other. |
| Longest Common Substring Comparator (Matching) | Finds the longest common substring repeatedly down to a minimal substring length. |
| Metaphone Comparator (Matching) | Compares field values using the Metaphone phonetic algorithm. |
| Soundex Comparator (Matching) | Compares field values using the Soundex phonetic algorithm. |
| Weighted Levenshtein Comparator (Matching) | Similar to the Levenshtein comparator but allows weighting of substrings (e.g. giving digits in addresses higher importance).
Useful for street fields containing both street name and house number. |
| q-Gram Comparator (Matching) | Uses n-grams of field values to calculate similarity. It is similar to Levenshtein but more tolerant of reordered tokens, e.g. "Hotel Lindner Hamburg" and "Lindner Hotel Hamburg".
Configurable by q-parameter (default size is 3, sufficient for most business partner use cases). |
Decision: Matching score threshold
Each match when searching for company data comes with an overall matching score. The higher the score, the more confidence that the found record represents the actual searched entity. By default all matches with a score greater than 0,5 are returned. A score of 0,5 is to be interpreted that there is a 50% confidence that the record is the right one, and a 50% confidence that the record represents a different entity. The higher you set the threshold, the less matches you will receive, but the then remaining ones with a higher confidence.