mHUB - 360Science for Spark - DedupeHive


The DedupeHive application demonstrates using the 360Science for Spark ‘Row’ classes to work with a Hive datasource, Rows, and Spark Dataset & RDDs.



The command line argument to DedupeHive is the name of a configuration file. This is an xml file in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<!-- Define one input for single table Matching -->

<!-- Define two inputs for Overlap Matching
</input> -->

<!-- Output database and table name. -->

<input table="0" columns="|UniqueRef|FullName|Company|Address1|Address2|City|State|Zip" />

The <dedupeHive> section is specific to this application.

warehouseLocation Location of the Hive warehouse.
input Details of an input database table.
table The name of a database table.
output Details of an output database table.
delimiter The delimiter used when converting records to delimited string in order to pass to the underlying matching engine.
licenceFile A file containing the product activation code.

The <hub> section configures the underlying matching engine. Refer to the mHUB documentation for details. The <hub> section must contain the following sub-sections: data, matching, threads, advanced.

Running the sample

The DedupeHive sample can’t be run out-the-box like the DedupeTextFile sample because it requires a Hive database source. Nevertheless, a sampleconfig.xml and are provided to demonstrate how to set this up to run.