Duplicate Identification

From CDQ
Capability/Duplicate Identification
Jump to navigation Jump to search
Name Name of a concept, e.g. a data model concept. In contrast to terms, the name does not depend on a given context, e.g. a country-specific language. Duplicate Identification (category: , sort rank: 110, product: Data Quality Services)
Short description Informal and short human-readable definition of a concept. Duplicate matching and consolidation with detailed and feature-rich configurations.
Description Informal and comprehensive human-readable definition of a concept. Duplicate matching compares all records of a given set of custom databases to each other, identifies similar records, and groups "best matches" in matching groups. The process to get a duplicate report comprises three steps: (1) select custom databases to be analyzed and a matching configuration to configure the matching algorithm, (2) start a matching job and wait for the result (i.e. record links with similarity score), and (3) generate a duplicate report with (optionally) cleansed golden records for each matching group.

The general duplicate matching process can be divided into three major steps:

Duplicate Identification Overview.png

Attribute Selection

  • Selecting attributes which are qualified for establishing the identity of a business partner, i.e. attributes that are best suited for identifying records that represent the identical entity
    • A set of attributes by which business partners can be distinguished from each other (i.e. attributes that establish the identity of an organization) is to be defined. While some attributes are well qualified for this (e.g. equal names provide strong evidence for the existence of a duplicate, similarly certain business identifiers), others are only suitable in combination with others (e.g. the same city alone provides only very little evidence for the existence of a duplicate, as many organizations may be located in the same locality).
    • Similar name attributes provide strong evidence for the existence of a duplicate. However, there are many aspects that require careful consideration when comparing names. For example, it has to be decided if legal forms are to be considered, how to deal with acronyms (e.g. “BMW”), and which components the name should consist of. The latter aspect is quite important, as comparing legal entity names with organizational information (e.g. division names) or with brand or trade names hardly provides reliable results.
    • Address information is another strong indicator for the existence of duplicate business partner data. However, address attributes are only suitable for this purpose in combination with other attributes. In addition, addresses are characterized by a wide variety of possible representations.

Harmonization

  • Harmonizing the data and standardizing the selected values. This is usually done by applying temporary cleaners that only manipulate data for the candidate search and comparison
    • Cleaners harmonize and standardize the input data, e.g. if one data source uses abbreviations, whereas another uses full names only. The selection of attributes and the required harmonization depends on the use case for the identification of duplicates. For example, the consolidation of records that represent parts of one and the same legal entity requires a different treatment than the aggregation of data objects that belong to the same site of a business partner. In the first case, D-U-N-S numbers are a good indicator, as well as company register IDs, tax numbers or names. In the second case, tax numbers, or any kind of legal entity identifier, are not really qualified, as a focus needs to be laid on address attributes (because a business partner may have many locations, but there is e.g. only one legal entity with one tax number). Tax numbers and the like therefore can only provide evidence to a certain extent.

Search & Compare

  • Matching the input data by searching for and comparing business partners by means of the selected attributes
Release status The release status in terms of development progress or maturity of a product feature or a business capability.<br/><code>EMPTY</code> (0): No feature considered yet, just rough idea for capability.<br/><code>IDEA</code> (1): Just an idea, not yet designed in detail.<br/><code>DESIGN</code> (2): Software design ready, development not yet started.<br/><code>DEVELOPMENT</code> (3): Software development in progress.<br/><code>ALPHA</code> (4): First functional release, in terms of a Minimal Viable Product (MVP).<br/><code>BETA</code> (5): Tested by selected users.<br/><code>RC</code> (6): Release candidate, fully tested, not yet used in production by many customers.<br/><code>LIVE</code> (7): Used in production by customers, fully monitored and supported.<br/><code>DEPRECATED</code> (-1): End of life planned, but still available.<br/><code>EOL</code> (-2): End of life, historic service, no longer available.<br/><code>BROKEN</code> (-3): Service was used in production but is currently not available. However, CDQ tries to repair or reactivate it. LIVE
Use cases Analyze and Prepare a Business Partner Storage for Get Clean, Cleanse and Enrich a Business Partner Storage, Duplicate avoidance in the business partner data maintenance workflow, Iterative Duplicate Check, Link and Consolidate Data from Multiple Systems, Simple Duplicate Check
Apps
Following apps provide this capability
Duplicate Matching
  • Duplicate Matching Video
Decision Log App
APIs
Following APIs provide this capability

Features

Feature Short description Informal and short human-readable definition of a concept. Release status The release status in terms of development progress or maturity of a product feature or a business capability.<br/><code>EMPTY</code> (0): No feature considered yet, just rough idea for capability.<br/><code>IDEA</code> (1): Just an idea, not yet designed in detail.<br/><code>DESIGN</code> (2): Software design ready, development not yet started.<br/><code>DEVELOPMENT</code> (3): Software development in progress.<br/><code>ALPHA</code> (4): First functional release, in terms of a Minimal Viable Product (MVP).<br/><code>BETA</code> (5): Tested by selected users.<br/><code>RC</code> (6): Release candidate, fully tested, not yet used in production by many customers.<br/><code>LIVE</code> (7): Used in production by customers, fully monitored and supported.<br/><code>DEPRECATED</code> (-1): End of life planned, but still available.<br/><code>EOL</code> (-2): End of life, historic service, no longer available.<br/><code>BROKEN</code> (-3): Service was used in production but is currently not available. However, CDQ tries to repair or reactivate it.
Data Mirror Lookup Lookup business partner data in a data mirror for e.g. identifying potential duplicates during new record creation LIVE
Duplicate Consolidation Consolidate a group of identified duplicate records into one surviving record based on a Duplicate Consolidation Configuration LIVE
Duplicate Detection Identify potential duplicates in a given set of business partner data records. LIVE
Duplicate Matching Configuration Define how business partner records from a data set are compared by the duplicate anaylsis algorithm. LIVE
Matching Cleaner Configuration Cleaners transform or normalize data before it is effectively compared by the duplicate analysis algorithm. LIVE
Matching Comparator Configuration Comparators compare data from the the same attribute but different records and produce a match score. LIVE
Matching and Consolidation Reports Duplicate and record linkage reporting LIVE
Record Linkage Identify identical records in two or more datasets LIVE
Duplicate Consolidation Configuration Defines how records in a duplicate matching group are consolidated into a "best guess" or "golden" record. BETA
Match Candidate Review Manual review of matches proposed by duplicate detection or record linkage and consideration of the reviews in subsequent matching runs BETA
Duplicate Monitoring Continuous monitoring of business partner data in a Data Mirror for duplicates ALPHA
Duplicate Monitoring Configuration Configure the matching configuration, the monitoring interval and other options for continuous duplicate monitoring of business partner data in a Data Mirror IDEA


Why is efficient matching of business partner data challenging

Matching challenges
Typical challenges when matching business partner names
  • The representation of the name is not standardized. While for some names abbreviations are used (e.g. “Inst. of Information Management”), others include legal forms (“ZF Friedrichshafen AG”) or are written in uppercase or lowercase letters only (“BAYER AG”).
  • The name is represented in different characters (e.g. Chinese, Cyrillic, and/or Latin characters).
  • The name includes misplaced information. For example, c/o information is added to the name instead of the respective data attribute.
  • In some cases, acronyms are used, while in other cases the full name is used (e.g. “BMW AG” vs. “Bayerische Motorenwerke AG”).
  • Misspellings may occur in various forms: characters added (e.g. “Bayern AG” instead of “Bayer AG”), characters omitted (e.g. “Byer AG”), characters replaced (e.g. “Baier AG”), or characters transposed (e.g. “Bayre AG”).
  • The order of name components is inconsistent (e.g. “Lindner Hotel Hamburg” vs. “Hotel Lindner in Hamburg”).
  • The name is represented differently (e.g. only one attribute “business partner name” in one data model vs. “name 1-5” in SAP’s data model).
Typical challenges when matching business partner addresses
  • Misspellings of cities or thoroughfares (see above).
  • No consistent use of abbreviations (e.g. “Lindenstr.” vs. “Lindenstrasse”).
  • Misplaced information (e.g. c/o information or building information is included in the thoroughfare, or the house number is sometimes included in the street name and sometimes placed separately).
  • Missing attributes (e.g. one system includes building information, another does not, or data was just not maintained).
  • Original names of cities vs. international names (e.g. “München” vs. “Munich”, “Mailand” vs. “Milano”).
  • Used characters (e.g. Chinese etc., see above).
  • Semantic ambiguities for certain fields (e.g. different post codes available in Ireland: Eircode vs. GeoDirectory vs. Loc8 Code).
  • Post box addresses vs. street addresses (sometimes they are maintained separately, sometimes the street and post box address are maintained in one address data object).