FuzzyDupes:NET Assembly Documentation

Fast Duplicate Search in DataTable Data

Library: KS.FuzzyDupesNet.dll (DotNet 2.0/4.0 Assembly)
Version: 6.x/7.x

Dependencies: System/none

 

Interface


General Information

The FuzzyDupes:NET Assembly enables you to find similar records in database record data.

The data is processed in memory, the control implements optimized data structures in c# for this purpose. The algorithms used are universal for any string data. Adress data can be processed as any other data as well.

There is one main function that receives a System.Data.DataTable and returns another DataTable with the results. Some parameters can adjust the search.


Download

You can always find the current version at /download/FuzzyDupesNETSetup.exe

Different versions will have the same download filename.


License

Disclaimer: FuzzyDupes:NET Assembly is provided "as is" with no warranty of any kind.

FuzzyDupes:NET Assembly is not Freeware. The Demo Version shows an about screen prior to every search. With a valid license, the about screen is never shown.

The Demo Version is for testing purposes only and must not be used in a productive environment..

More information about licensing the library can be found at the FuzzyDupes:NET product page. Or contact info@kroll-software.ch


Installation

  1. Copy the file KS.FuzzyDupesNet.dll and \de\KS.FuzzyDupesNet.resources.dll to your harddrive
  2. Add a dependency to this assembly into your DotNet 2.0 project (VB.NET, C#, or whatever..)
  3. Check Local Copy and set Specific Version to false

Methods

AboutBox()

Return Value: void
Parameters: none or [ParentForm]
 
Shows the AboutBox.

 

Reset()

Return Value: void
Parameters: none
 
Resets the settings for the next search. Call this function prior to every search.

 

Cancel()

Return Value: void
Parameters: none
 
Cancels a long running process. You can call this in a Progress-Event.

 

SetColumnOption()

Set options for each column

Return Value: void
Parameters: int ColumnIndex, bool Cluster, bool DupeSearch, double Weight, bool CompareNull, string ColumnName[, int ImportColumnIndex]
 
Set the Options for your Variant-Array.

  • ColumnIndex is zero-based (first column has the index 0)
  • Cluster: Include this column in the cluster search (default = false)
  • DupeSearch: Include this column in the duplicate search (default = false)
  • Weight: Weight for this column relative to other weights. Values 0 - 1 are weighted, Values > 1 require identical value (no fuzziness), Values < 0 require a different value (no fuzziness). (default = 0.5)
  • CompareNull: NULL values in this column are compared (default = false*)
  • ColumnName: Name of this column
  • ImportColumnIndex **: Mapping to the column index in the second table for FuzzyMatch() and FuzzyMerge()

Call this prior to DupeSearch(), FuzzyMatch() and FuzzyMerge()

*) Important note: Set CompareNull to 'true' for columns that are filled with data in (nearly) all rows (e.g. Last Name, Street, ZIP, City) and set CompareNull to 'false' for other columns which may contain NULL values in many rows (e.g. First Name, Phone Number, ...)

**) FuzzyMatch() and FuzzyMerge() require a mapping and so a valid ImportColumnIndex set for each column

 

DupeSearch()

Search for duplicate values in a System.Data.DataTable

Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, double Threshold, double ClusterThreshold, DupeReturnResults ReturnResults, bool ShowProgress[, System.Windows.Forms.Form ParentForm]

  • DataTable T: The data table to perform the search in
  • Threshold: Threshold option for Duplicate Search (default = 90.0)
  • ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
  • DupeReturnResults: DRR_DUPES|DRR_ALL - Return Dupes only, Return all records (default=DRR_DUPES)
  • ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog. (default = true)

The returned DataTable has an additional column containing GUIDs for each Record. These GUIDs are identical for similar records.

 

FuzzyMerge()

Merge two DataTables with fuzzyness

Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, System.Data.DataTable TMerge, double Threshold, double ClusterThreshold, MergeReturnResults ReturnResults, bool ShowProgress[, System.Windows.Forms.Form ParentForm]

This method requires a ImportColumnIndex set for each column with SetColumnOptions()

  • DataTable T: first data table
  • DataTable TMerge: second data table to merge
  • Threshold: Threshold option for Duplicate Search (default = 90.0)
  • ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
  • ReturnResults: MRR_IMPORT|MRR_ALL|MRR_DUPES|MRR_CLEANMERGED - return new records only | return all records | return Dupes only | return the merged result without any dupes (default=MRR_IMPORT)
  • ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog. (default = true)

The returned DataTable has up to 2 additional columns (depending on ReturnResults)

  • GUIDs for each Record
  • Source of record (0=T, 1=TMerge)

These GUIDs are identical for similar records.

 

FuzzyMatch()

Matches two System.Data.DataTable with fuzzyness,
positive/negative match between two lists

Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, System.Data.DataTable TMatch, double Threshold, double ClusterThreshold, MatchReturnResults ReturnResults, bool ShowProgress, System.Windows.Forms.Form ParentForm

This method requires a ImportColumnIndex set for each column with SetColumnOptions()

  • DataTable T: first data table
  • DataTable TMatch: second data table to match with
  • Threshold: Threshold option for Duplicate Search (default = 90.0)
  • ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
  • ReturnResults: MAR_ALL|MAR_POSITIVE|MAR_NEGATIVE - return all records | return positive matches | return negative matches (default=MAR_ALL)
  • ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog. (default = true)

The returned DataTable has 1 additional column containing 0=No-Match, or Index (1 based) from TMatch where a match was found

 

ExactMatch()

Matches two DataTables exact (without fuzzyness) for a given column

Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, System.Data.DataTable TMatch, int MatchColumnIndex, ExactMatchReturnResults ReturnResult

  • DataTable T: first data table
  • DataTable TMatch: second data table to match with
  • MatchColumnIndex: column index where data must match
  • ReturnResults: MAR_POSITIVE|MAR_NEGATIVE - match positive / negative

 

DeleteDupes()

Delete duplicate values from a DataTable (result from another method) depending on a FuzzyDupesID (GUID)

Return Value: System.Data.DataTable or null
Parameters: System.Data.DataTable T, int FIDColumnIndex, int SortColumnIndex, int MaxColumn, DeleteOrders DeleteOrder, DeleteReturnResults ReturnResults

  • DataTable T: the data
  • FIDColumnIndex: Zero-based ID of the FuzzyDupesID (GUID) Column (default = -1)
  • SortColumnIndex: index of the column to sort the results
  • MaxColumn: only return MaxColumnIndex number of columns. Used to cut the FuzzyDupesID (GUID) from the result (default = -1)
  • DeleteOrder: DO_KEEP_FIRST|DO_KEEP_LAST - remove all dupes except the first / last (default=DO_KEEP_FIRST)
  • ReturnResults: DER_CLEAN|DER_DELETED - cleaned result / deleted records

Use this function, to get a clean (dupe free) search result:

  1. Search for dupes with DupeSearch() and ReturnResults=DRR_ALL
  2. Call this function to remove the dupes from the result

 

Canceled

This property returns true, when a long running process was canceled.

Return Value: boolean
Parameters: none

 

 


Working with Normalization Rules

The Method SetNormalizeOption() sets a normalization rule to a column.

AddNormalizeRule() adds a search/replace pair to the current normalize list

You can use the class NormalizeList to create new, load and save lists for normalization.

--- End ---