mHUB - 360Science for Spark - DedupeJdbc


The DedupeJdbc application demonstrates using the 360Science for Spark ‘Row’ classes to work with SQL tables, Rows, and Spark Dataset & RDDs.



The command line argument to DedupeJdbc is the name of a configuration file. This is an xml file in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<!-- Define one input for single table Matching -->

<!-- Define two inputs for Overlap Matching
</input> -->

<!-- Output database and table name. -->

<input table="0" columns="|UniqueRef|FullName|Company|Address1|Address2|City|State|Zip" />

The <dedupeJdbc> section is specific to this application.

input Details of an input database table.
connectionString Jdbc connection string.
table The name of a database table.
partitionColumn, lowerBound, upperBound, numPartitions partitionColumn, lowerBound, upperBound, numPartitions must all be specified to load the data into multiple partitions in parallel. Refer to the Spark documentation for details.
output Details of an output database table.
delimiter The delimiter used when converting records to delimited string in order to pass to the underlying matching engine.
licenceFile A file containing the product activation code.

The <hub> section configures the underlying matching engine. Refer to the mHUB documentation for details. The <hub> section must contain the following sub-sections: data, matching, threads, advanced.

Running the sample

The DedupeJdbc sample can’t be run out-the-box like the DedupeTextFile sample because it requires a SQL database and the relevant Jdbc drivers. Nevertheless, a sampleconfig.xml and are provided to demonstrate how to set this up to run.