360Science for Spark - Class Reference

The jar HubSpark.jar is the matchIT Hub for Spark package (com.matchIT.Hub.spark) and contains the following classes. You will only need this jar if you are building your own applications, the sample apps already include it.

matchIT for Spark API Classes

The following classes provide a high-level interface to the matchIT Hub functionality.

DedupeConfiguration Base class for configuration options for matchIT for Spark API and sample applications.
HubSchema Provides a schema for each stage of processing, based on configuration settings.
HubSpark Base class for HubSparkDataFrame and HubSparkRDD.
HubSparkDataFrame High-level functions for deduping data held in DataFrames.
HubSparkRDD High-level functions for deduping data held in RDDs.
HubStats Used to collect statistics across the various stages and partitions of a job.

DedupeConfiguration

Base class for parsing XML configuration options for matchIT for Spark API and sample applications.

The constructor is passed a tag name used to indicate the name of the tag, within <config>, in which to find the configuration options. Default is "dedupeSpark", i.e. <config><dedupeSpark>...</dedupeSpark></config>

DedupeConfiguration(String appTag)

The xml settings are in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<config>
<dedupeSpark>
<licenceFile>./activation.txt</licenceFile>
<delimiter>\t</delimiter>
<logLevel>error</logLevel>
<warehouseLocation>/user/hive/warehouse</warehouseLocation>
<groupingAlgorithm>hub</groupingAlgorithm>
<schema>|id|full_name|last_name|addr1|addr2|city|state|zip</schema>
<idField>0</idField>
<maxIterations>4</maxIterations>
</dedupeSpark>
</config>

licenceFile A file containing the product activation code.
delimiter The delimiter used in the input file and used in the delimited Strings in the RDDs.
logLevel Minimum severity level of errors to log. See org.apache.log4j.Level.
  • Off
  • Fatal
  • Error
  • Warn
  • Info
  • Debug
  • Trace
warehouseLocation Location of the Hive warehouse. Only required if using Hive.
groupingAlgorithm The grouping algorithm to use to group together matching pairs.
  • Hub - matchIT Hub's Grouping algorithm. Requires coalescing the pairs to a single mode, so has limited scalabilty;
  • graphX - Apache Spark's built in GraphX connected components algorithm. Limited scalability, only supported for DataFrames;
  • Kwartile - Kwartile's Map/Reduce implementation of connected components. Scales well to 10s of billions of records, only supported for RRDs.
schema Field names to use in a DataFrame. This setting is optional and if present overrides the field names in the header row. This is useful if you need to rename a unique ref field to 'id' to use graphX for grouping.
idField Field number to use as the unique ref field in an RDD. This setting is required when using Kwartile for grouping, for joining the pairs of unique ref's back to the source data.
maxIterations The maximum number of iterations to allow when using the Kwartile grouping algorithm.

HubSpark

Base class for high-level functions for deduping data held in DataFrames (Dataset<Row>) or RDDs (RDD<String> - where String is a delimited record).

The class’ public methods are:

void init(String appName, DedupeConfiguration config) Creates a SparkSession based on the supplied config details and application name. Creates instances of HubSchema and HubStats.
void close() Prints stats and closes the SparkSession.
SparkSession getSparkSession() Returns the SparkSession.
JavaSparkContext getSparkContext() Returns the JavaSparkContext.

HubSparkDataFrame

The class’ public methods are:

Dataset<Row> matching(Dataset<Row> mainInput) Perform internal matching on a dataset and return a dataset of matching pairs.
Dataset<Row> matching(Dataset<Row> mainInput, Dataset<Row> overlapInput) Perform overlap matching on two datasets and return a dataset of matching pairs.
Dataset<Row> grouping(Dataset<Row> pairs) Perform Grouping (using matchIT Hub's Grouping mode) on a dataset of matching pairs.
Dataset<Row> groupingGraph(Dataset<Row> input, Dataset<Row> pairs) Perform Grouping (using graphX's connected components algorithm) on a dataset of matching pairs. The input DataFrame must have a unique ref column named 'id', the pairs DataFrame must have unique ref columns named 'src', & 'dst' (HubSchema takes care of naming the matching pairs columns).

HubSparkRDD

The class’ public methods are:

JavaRDD<String> matching(JavaRDD<String> mainInput) Perform internal matching on an RDD and returns an RDD of matching pairs.
JavaRDD<String> matching(JavaRDD<String> mainInput, JavaRDD<String> overlapInput) Perform overlap matching on two RDDs and return an RDD of matching pairs.
JavaRDD<String> grouping(JavaRDD<String> pairs) Perform Grouping (using matchIT Hub's Grouping mode) on an RDD of matching pairs.
JavaRDD<String> groupingKwartile(JavaRDD<String> input, JavaRDD<String> pairs) Perform Grouping (using Kwartiles's connected components algorithm) on an RDD of matching pairs. The unique ref column number in the input RDD is specified by the idField configuration setting.

HubSchema

Provides a schema for each stage of processing, based on configuration settings. Helper class that populates org.apache.spark.sql.types.StructType schema structures for each data processing class' inputs and outputs.

The constructor is passed an activation code and the same Hub settings xml used by the data processing classes.

HubSchema(String activationCode, String hubSettings)

The class’ public methods are:

StructType getInputSchema(String columns) Generates an input schema given a delimited list of input columns (columns must start with the delimiter used).
StructType getKeyGenerationOutputSchema(int table) Generates Key Generation output schema for the given table.
StructType getPairMatchingOutputSchema() Generates the Pair Matching output schema (i.e. matching pairs).
StructType getGroupingOutputSchema() Generates the Grouping output schema.

HubStats

Used to collect statistics across the various stages and partitions of a job. Create one instances of this class and pass it to all the data processing tasks. When processing is finished call HubStats::print() to display the total statistics.

The constructor is passed the JavaSparkContext and the number of exact and fuzzy keys in the configuration.

HubStats(JavaSparkContext context, int numExactKeys, int numFuzzyKeys)

The class’ public methods are:

void add(String statsXml) Adds to the accumulators, the figures from the given Hub statistics XML.
void addClusters(long newClusters, long newLargeClusters) Adds the given values to the accumulators for the number of clusters and large clusters.
void print() Prints the stats to System.out.

Data Processing Classes

The following low-level classes implement Spark functions used in transformations and have two versions. A “String” version for working with JavaRDDs, where String is a delimited record, and a “Row” version for working with JavaRDD/Dataset.

GroupingRow Data processing class that performs Grouping of matching pairs.
GroupingString Data processing class that performs Grouping of matching pairs.
GroupMatchingString Data processing class that performs GroupMatching to post-process records already grouped.
KeyedToKeyValuesRow Data processing class that converts Keyed records into {key, value} pairs.
KeyedToKeyValuesString Data processing class that converts Keyed records into {key, value} pairs.
KeyGenerationRow Data processing class that performs Key Generation
KeyGenerationString Data processing class that performs Key Generation
PairMatchingRow Data processing class that performs Pair Matching.
PairMatchingString Data processing class that performs Pair Matching.

Grouping

Groups matching pairs.

The constructors take: activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

GroupingString(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

GroupingRow(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

GroupingRow

Implements FlatMapFunction<Iterator, Row> for use with JavaRDD::mapPartitions().

JavaRDD groups = allPairs.mapPartitions(
new GroupingRow(activationCode,
hubSettings,
delimiter,
stats));

GroupingString

Implements FlatMapFunction<Iterator, String> for use with JavaRDD::mapPartitions().

JavaRDD groups = allPairs.mapPartitions(
new GroupingString(activationCode,
hubSettings,
delimiter,
stats));

GroupMatching

Re-processes groups of matching records. This is for use when matching pairs have been grouped by some other means than Hub's Grouping mode - for example, Kwartile's Map/Reduce implementation of connected components. The point of reprocessing using Hub's GroupMatching mode is to apply the Bridging Prevention, and Master Record Identification functionality and to apply scores etc.

The constructors take: activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

GroupMatchingString(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

GroupMatchingString

Implements FlatMapFunction<Iterator<Tuple2<String, Iterable<String>>>, String> for use with JavaPairRDD::mapPartitions().

JavaRDD groups = grouped.mapPartitions(
new GroupMatchingString(activationCode,
hubSettings,
delimiter,
stats));

KeyedToKeyValues

Applied to the output of KeyGeneration, generates {key, value} pairs for each key. The output of KeyGeneration has all the key values appended in new field. This task converts that input {key, value} pairs and converts the JavaRDD into a JavaPairRDD.

KeyedToKeyValuesRow

The constructor takes the StructType schema used in the Row records.

KeyedToKeyValuesRow(StructType schema)

Implements PairFlatMapFunction<Row, String, Row> for use with JavaRDD::flatMapToPair().

JavaPairRDD<String, Row> keys = keyed.javaRDD().flatMapToPair(
new KeyedToKeyValuesRow(keyGenOutputSchema));

KeyedToKeyValuesString

The constructor takes the delimiter used in the String records.

KeyedToKeyValuesString(String delimiter)

Implements PairFlatMapFunction<String, String, String> for use with JavaRDD::flatMapToPair().

JavaPairRDD<String, String> keys = keyed.flatMapToPair(
new KeyedToKeyValuesString(delimiter));

KeyGeneration

Appends all key values to each record using Hub in a new “Key Generation” mode.

The constructors take: a table number (0, 1, or 2), activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

KeyGenerationString(int table,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

KeyGenerationRow(int table,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

KeyGenerationRow

Implements MapPartitionsFunction<Row, Row> for use with Dataset::mapPartitions().

Dataset keyed = rowsDF.mapPartitions(
new KeyGenerationRow(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats),
encoder);

Implements FlatMapFunction<Iterator, Row> for use with JavaRDD::mapPartitions().

JavaRDD keyed = rowsDF.javaRDD().mapPartitions(
new KeyGenerationRow(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats));

KeyGenerationString

Implements PairFlatMapFunction<String, String, String> for use with JavaRDD::mapPartitions().

Example usage:

JavaRDD keyed = rows.mapPartitions(
new KeyGenerationString(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats));

PairMatching

Applied to the output of KeyedToKeyValues grouped by key, compares every record in each group with every other record in the group (whilst avoiding duplicate comparisons). Sends pairs of records to Hub in Pair Matching mode. Outputs matching pairs.

The constructors take: a flag to indicate if overlap matching, activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.

PairMatchingString(boolean overlap,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

PairMatchingRow(boolean overlap,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)

PairMatchingRow

Implements FlatMapFunction<Iterator<Tuple2<String, Iterable>>, Row> for use with JavaPairRDD<String, Iterable>::mapPartitions().

JavaRDD pairs = clusters.mapPartitions(
new PairMatchingRow(overlap,
activationCode,
hubSettings,
delimiter,
stats));

PairMatchingString

Implements FlatMapFunction<Iterator<Tuple2<String, Iterable>>, String> for use with JavaPairRDD<String, Iterable>::mapPartitions().

JavaRDD pairs = clusters.mapPartitions(
new PairMatchingString(overlap,
activationCode,
hubSettings,
delimiter,
stats));

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help