KSPhonet OCX Documentation
Fast Duplicate Search in Recordset Data
Library: KSPhonetLib (KSPhonet ActiveX Control module)
GUID: {BC5CFE93-56AE-11D4-80C6-0048546EA2D1}
Version: 5.7
Dependencies: none
General Information
Download
License
Installation
Interface
Sample Source Code
General Information
KSPhonet Control enables you to find similar records in database record data.
The data is processed in memory, the control implements optimized data structures in c++ for this purpose.
The algorithms used are universal for any string data. Adress data can be processed as any other data as well.
There is one main function that receives a Variant-Array returned from ADODB.Recordset.GetRows() and returns another
Variant-Array with the results. Some parameters can adjust the search.
Download
You can always find the current version at
http://www.kroll-software.de/download/fuzzydupescom.zip
Different versions will have the same download filename.
License
Disclaimer: KSPhonet-Control is provided "as is" with no warranty of any kind.
KSPhonet-Control is not Freeware.
The Demo Version returns a maximum of 50 records. With a valid license, the control returns all records found.
More information about licensing the control can be found at the
Kroll-Software website.
Or contact info@kroll-software.de
Installation
- Copy the file ksphonet.ocx to a directory with a short (DOS) pathname
- Open the windows command prompt
- Register the control with the following command
regsvr32.exe <Drive:Path\>ksphonet.ocx
Methods
AboutBox()
Return Value: void
Parameters: none
Shows the AboutBox.
Reset()
Return Value: void
Parameters: none
Resets the settings for the next search. Call this function prior to every search.
Cancel()
Return Value: void
Parameters: none
Cancels a long running process.
You can call this in a Progress-Event.
SetColumnOption()
Set options for each column
Return Value: void
Parameters: ColumnIndex AS Integer, Cluster AS Boolean, DupeSearch AS Boolean, Weight AS Double, CompareNull As Boolean, ColumnName As String
Set the Options for your Variant-Array.
- ColumnIndex is zero-based (first column has the index 0)
- Cluster: Include this column in the cluster search (default = true)
- DupeSearch: Include this column in the duplicate search (default = true)
- Weight: Weight for this column relative to other weights. Values 0 - 1 are weighted, Values > 1 require identical value (no fuzziness). (default = 1)
- CompareNull: NULL values in this column are compared (default = false*)
- ColumnName: Name of this column. Currently used for CSV-Export headers.
Call this prior to DupeSearch() and FuzzyMerge()
*) Important note: Set CompareNull to 'true' for columns that are filled with data in (nearly) all rows (e.g. Last Name, Street, ZIP, City)
and set CompareNull to 'false' for other columns which may contain NULL values in many rows (e.g. First Name, Phone Number, ...)
DupeSearch()
Search for duplicate values in a variant-array (record data)
Return Value: Variant-Array or Empty Variant (vt=VA_EMPTY)
Parameters: varArray AS Variant, IDColumnIndex AS Integer, Threshold As Double, ClusterThreshold As Double, Options As String, ReturnResults AS Enum, ShowProgress As Boolean
- varArray: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- IDColumnIndex: Zero-based ID of the Identity Column (Column with destinct values, not required), -1 if none.
- Threshold: Threshold option for Duplicate Search (default = 90.0)
- ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
- Options: Reserved, not currently used. (default = "")
- ReturnResults: DRR_DUPES|DRR_ALL - Return Dupes only, Return all records (default=DRR_DUPES)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog.
(default = true)
The returned Variant has an additional column containing GUIDs for each Record.
These GUIDs are identical for similar records.
FuzzyMerge()
Merge two arrays (record data) with fuzzyness
Return Value: Variant-Array or Empty Variant (vt=VA_EMPTY)
Parameters: varArray AS Variant, varImport AS Variant, IDColumnIndex AS Integer, Threshold As Double, ClusterThreshold As Double, Options As String, ReturnResults As Enum, ShowProgress As Boolean, LogFilenameImported As String, LogFilenameRejected As String
- varArray: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- varImport, data to merge: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- IDColumnIndex: Zero-based ID of the Identity Column (Column with destinct values, not required), -1 if none.
- Threshold: Threshold option for Duplicate Search (default = 90.0)
- ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
- Options: Reserved, not currently used. (default = "")
- ReturnResults: MRR_IMPORT|MRR_ALL|MRR_DUPES|MRR_CLEANMERGED - return new records only | return all records | return Dupes only | return the merged result without any dupes (default=MRR_IMPORT)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog.
(default = true)
- LogFilenameImported: Full path to a log file. CSV-file containing all new records (default = "")
- LogFilenameRejected: Full path to a log file. CSV-file containing all rejected records (dupes), (default = "")
The returned Variant has up to 2 additional columns (depending on ReturnResults)
- GUIDs for each Record
- Source of record (0=varArray, 1=varImport)
These GUIDs are identical for similar records.
The log file exports use the settings from the properties CSVDelimiter, CSVTextInQuotes, CSVColumnHeaders
FuzzyMatch()
Matches two arrays (record data) with fuzzyness,
positive/negative match with external list
Return Value: Variant-Array or Empty Variant (vt=VA_EMPTY)
Parameters: varArray AS Variant, varMatch AS Variant, IDColumnIndex AS Integer, Threshold As Double, ClusterThreshold As Double, Options As String, ReturnResults As Enum, ShowProgress As Boolean, LogFilenamePositive As String, LogFilenameNegative As String
- varArray: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- varMatch, data to match: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- IDColumnIndex: Zero-based ID of the Identity Column (Column with destinct values, not required), -1 if none.
- Threshold: Threshold option for Duplicate Search (default = 90.0)
- ClusterThreshold: Adjusts the Cluster-Size (default = 0.666667)
- Options: Reserved, not currently used. (default = "")
- ReturnResults: MAR_ALL|MAR_POSITIVE|MAR_NEGATIVE - return all records | return positive matches | return negative matches (default=MAR_ALL)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog.
(default = true)
- LogFilenamePositive: Full path to a log file. CSV-file containing all matching records from the varMatch-Array (default = "")
- LogFilenameNegative: Full path to a log file. CSV-file containing all negative/not matching records from the varMatch-Array (default = "")
The returned Variant has 1 additional column containing 0=No-Match, or Index (1 based) from varMatch where a match was found
The log file exports use the settings from the properties CSVDelimiter, CSVTextInQuotes, CSVColumnHeaders
DeleteDupes()
Delete duplicate values from a variant array (result from another method) depending on a DupeID (GUID)
Return Value: Variant-Array or Empty Variant (vt=VA_EMPTY)
Parameters: varArray AS Variant, FIDColumnIndex AS Integer, MaxColumnIndex As Integer, DeleteOrder As Enum, ReturnResults As Enum, ShowProgress As Boolean, LogFilenameDeleted As String
- varArray: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- FIDColumnIndex: Zero-based ID of the FuzzyDupes-ID (GUID) Column (default = -1), if not given, this column will be detected automatically.
- MaxColumnIndex: only return MaxColumnIndex number of columns. Used to cut the DupesID (GUID) from the result (default = -1)
- DeleteOrder: DO_KEEP_FIRST|DO_KEEP_LAST - remove all dupes except the first / last (default=DO_KEEP_FIRST)
- ReturnResults: DER_CLEAN|DER_DELETED - cleaned result / deleted records
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog.
(default = true)
- LogFilenameDeleted: Full path to a log file. CSV-file containing all removed records (default = "")
Use this function, to get a clean (dupe free) search result:
- Search for dupes with DupeSearch() and ReturnResults=DRR_ALL
- Call this function to remove the dupes from the result
The log file exports use the settings from the properties CSVDelimiter, CSVTextInQuotes, CSVColumnHeaders
VariantArray2CSVFile()
Export a results from a search to a CSV text file
Return Value: BOOL
Parameters: varArray AS Variant, Filename AS String
Export a Variant-Array to a CSV file.
Returns true when successful.
- Uses the settings from the properties CSVDelimiter, CSVTextInQuotes, CSVColumnHeaders
- Shows the column headers set with SetColumnOption()
ExactMatchByColumn()
Match two arrays exactly by a key-column
Return Value: Variant-Array or Empty Variant (vt=VA_EMPTY)
Parameters: varArray AS Variant, varMatch AS Variant, MatchColumnIndex AS Integer, ReturnResults As Enum, ShowProgress As Boolean
- varArray: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- varMatch, data to match: 2 dim Variant-Array (e.g. returned by a ADODB.Recordset.GetRows() call)
- MatchColumnIndex: Zero-based ID of the Key Column (Column with destinct values, required).
- ReturnResults: MAR_ALL|MAR_POSITIVE|MAR_NEGATIVE - return all records | return positive matches | return negative matches (default=MAR_NEGATIVE)
- ShowProgress: Show a Progress dialog. If set to false, the control fires progress events so that you can implement your custom dialog.
(default = false)
This function can be used to delete a list from the record data.
The delete list is e.g. the result from DeleteDupes() with parameter DER_DELETED
CanceledByUser()
Returns true, when a long running process was canceled.
Return Value: BOOL
Parameters: none
Sample Source Code
Pseudo Code (VB6 like)
--- End ---
Back to Kroll-Software Website