360Science for Spark - DedupeHive

The DedupeHive application demonstrates using the matchIT Hub for Spark ‘Row’ classes to work with a Hive datasource, Rows, and Spark Dataset & RDDs.

DedupeHive.png

Configuration

The command line argument to DedupeHive is the name of a configuration file. This is an xml file in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<config>
<dedupeHive>
<warehouseLocation>/user/hive/warehouse</warehouseLocation>
<!-- Define one input for single table Matching -->
<input>
<table>input</table>
</input>

<!-- Define two inputs for Overlap Matching
<input>
</input> -->

<!-- Output database and table name. -->
<output>
<table>matchingPairsSpark</table>
</output>
<delimiter>\t</delimiter>

<licenceFile>./activation.txt</licenceFile>
<logLevel>error</logLevel>
<groupingAlgorithm>hub</groupingAlgorithm>
<idField>0</idField>
<maxIterations>4</maxIterations>
</dedupeHive>

<hub>
<data>
<input table="0" columns="|UniqueRef|FullName|Company|Address1|Address2|City|State|Zip" />
<options>...</options>
</data>
<matching>
<outputs>...</outputs>
</matching>
<threads>0</threads>
<advanced>
<nationality>USA</nationality>
</advanced>
</hub>
</config>

The <dedupeHive> section is specific to this application.

warehouseLocation Location of the Hive warehouse.
input Details of an input database table.
table The name of a database table.
output Details of an output database table.
delimiter The delimiter used when converting records to delimited string in order to pass to the underlying matching engine.

See DedupeConfiguration for a description of the configuration options: licenceFile, logLevel, groupingAlgorithm, idField, and maxIterations.

The <hub> section configures the underlying matching engine. Refer to the matchIT Hub documentation for details. The <hub> section must contain the following sub-sections: data, matching, threads, advanced.

Running the sample

The DedupeHive sample can’t be run out-the-box like the DedupeTextFile sample because it requires a Hive database source. Nevertheless, a sampleconfig.xml and run.sh are provided to demonstrate how to set this up to run.

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help