mHUB - Batch Examples

 

C# & Java batch examples

The C# & Java batch clients are similar to the HubTest examples for the Hub API.

Usage

BatchClient /testconfig=<testConfigFile>
            /settings=<settingsFile>
            /license=<licenseFile>
            [/stats=<statsFile>]
            [switches]

Where:
  <testConfigFile>  The name of a BatchClient XML configuration file
                    (see below for details).
  <settingsFile>    The name of a mHub XML settings file
                    (see the Configuration Guide for details).
  <licenseFile>     The name of the file containing the activation code.
  <statsFile>       The name of a file to write statistics to (XML format).

Switches:
  /help or /?       Display usage.

e.g.
BatchClient /testconfig=config.xml /settings=settings.xml /stats=stats.xml /license=activation.txt

Configuration

This configuration file is specific to BatchClient and gives database connection details and other options. There are four sections: service, log, input, & output.

               
<?xml version="1.0" encoding="utf-8" ?>
<config>

  <service>...</service>
  <log>...</log>
  <input>...</input>
  <output>...</output>

</config>

The <service> section configure general settings about the service. For example:

               
  <service>
    <host>http://localhost:8080</host>
    <engine destroy="true">0</engine>
    <pollInterval>2000</pollInterval>
    <uploadBlockRecordLimit>100000</uploadBlockRecordLimit>
    <compressionMinSize>1048576</compressionMinSize>
    <downloadBlockRecordLimit>10000</downloadBlockRecordLimit>
  </service>
  • <host> - configures the protocol, host and port of the matchIT Hub Service. The default is http://localhost:8080.
  • <engine> - configures which engine instance to use and whether or not to destroy it when finished. Set the engine instance id to 0 creates a new engine instance. You can specify an engine name instead of id, if no engine instance exists with the given name a new one will be created.
  • <pollInterval> - when waiting for processing to finish or results to be ready, specifies how many milliseconds to sleep between calls querying the state of the service. The default is 10,000ms (i.e. 10 seconds)
  • <uploadBlockRecordLimit> - specifies the number of records for the ProxyEngine to collect - via the AddData() method - before uploading to the service. Specify 0 to wait and upload all the records in one go. The default is 100,000.
  • <compressionMinSize> - when sending blocks of records to the service, blocks over the compression minimum size (bytes) will be compressed using gzip. Specify 0 (default) to disable compression.
  • <downloadBlockRecordLimit> - specifies the maximum number of records to be returned at a time by the ProxyEngine OpenDownloadStream() method - which calls the REST API results method. Specify 0 (default) to download all the results in one go.

The <log> section configures logging. See Logging for full details.

The <input> & <output> sections configure the data sources and target. The following outlines three different scenarios:

a. <connectionString> & <table>

Load data from a database and output to a database.
With this configuration BatchClient sends data to the Hub Service in batches and streams back the results.

The format of configurationString depends on the database and the language (C#/Java)

The C# BatchClient supports MS SQL and configurationStrings would be of the form:

<connectionString>Data Source=HOST; Initial Catalog=DATABASE; Integrated Security=SSPI;</connectionString>
<connectionString>Data Source=HOST; Initial Catalog=DATABASE; User Id=USERNAME;Password=PASSWORD;</connectionString>

The Java BatchClient supports mySQL, MS SQL, & Oracle (you will need to download the appropriate drivers) and configurationStrings would be of the form:

<connectionString>jdbc:mysql://HOST/DATABASE?user=USER&password=PASSWORD</connectionString>
<connectionString>jdbc:sqlserver://HOST;DatabaseName=DATABASE;IntegratedSecurity=true;</connectionString>

The following example (with Java BatchClient connectionstrings) loads data from a mySQL database and writes results to an MS SQL database.

This configuraton expects the Hub Service to be listening on localhost:8080 (default).

               
<?xml version="1.0" encoding="utf-8" ?>
<config>

  <service>
    <host>localhost:8080</host>
    <!-- Set engine to 0 to create a new instance -->
    <engine destroy="true">0</engine>
  </service>

  <log>
    <filename>debuglog.txt</filename>
    <severity>Debug</severity>
  </log>

  <input>
    <!-- Define one data source for single table Matching -->
    <dataSource>
      <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString>
      <table>TABLE</table>
      <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns>
    </dataSource>

    <!-- Define two data sources for Overlap Matching -->
    <!--
    <dataSource>
    </dataSource>
    -->
  </input>

  <!-- Output database in which to create tables for MatchingPairs, GroupedMatchingPairs,
       MatchingGroups, DedupedData, DuplicateData.
       Only the output types enabled in the mHub configuration file will be output. -->
  <output>
    <connectionString>jdbc:sqlserver://localhost;DatabaseName=DATABASE;IntegratedSecurity=true;</connectionString>
  </output>

</config>

b. <filename> & <delimiter>

Load data from and write results to delimited files.
With this configuration BatchClient sends the file names to Hub Service, which read and writes them directly. The Hub Service must have read/write access to the file names specified. Any non-alphanumeric character can be used as delimiter for the input file, '\t' is interpreted as tab. If no delimiter is specified, each record must begin with the delimiter used in that record (which can be different for each record). Each record in the output file begins with the delimiter used in that record.

<?xml version="1.0" encoding="utf-8" ?>
<config>

  <input>
    <!-- Define one data source for single table Matching -->
    <dataSource>
      <filename>DRIVE:\PATH\FILENAME.txt</filename>
      <delimiter>\t</delimiter>
      <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns>
    </dataSource>
  </input>

  <!-- Output file to recieve MatchingPairs, GroupedMatchingPairs, MatchingGroups, DedupedData,
       DuplicateData.
       Only the output types enabled in the mHub configuration file will be output. -->
  <output>
    <filename>DRIVE:\PATH\FILENAME.txt</filename>
  </output>

</config>

 

c. Running multiple overlaps against a master table that is only loaded once

To run an overlap specify two dataSources in the <input> section:

               
  <input>
    <dataSource>
      <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns>
      <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString>
      <table>TABLE</table>
    </dataSource>

    <dataSource>
      <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns>
      <filename>DRIVE:\PATH\FILENAME.txt</filename>
      <delimiter>\t</delimiter>
    </dataSource>
  </input>

To just load data into table 1 in preparation for running overlaps, define the columns for table 2 but don't specify either filename & delimiter or connectionString & table:

               
  <input>
    <dataSource>
      <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns>
      <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString>
      <table>TABLE</table>
    </dataSource>

    <dataSource>
      <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns>
    </dataSource>
  </input>

To run subsequent overlaps against a previously loaded table 1, define the columns for table 1 but don't specify either filename & delimiter or connectionString & table:

               
  <input>
    <dataSource>
      <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns>
    </dataSource>

    <dataSource>
      <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns>
      <filename>DRIVE:\PATH\FILENAME.txt</filename>
      <delimiter>\t</delimiter>
    </dataSource>
  </input>

When running the first overlap (or initial load), in the <service> section set:

               
    <engine destroy="false">0</engine>

The zero means it should create a new engine instance as usual, but the destroy="false" option tells it to leave that instance running with the table 1 data still loaded.

Run subsequent overlaps by specifying the engine number created in the first job, e.g.

               
    <engine destroy="false">1</engine>

 

d. <connectionString>, <table>, <filename> & <delimiter>

Load data from database and output to database via delimited files.
With this configuration BatchClient extract the data from the data source tables to files and send the filenames to the Hub Service.

Hub Service then writes results to the specified output file. BatchClient then loads the results into database tables.

In this scenario, it is safest to not specify a delimiter for the input file - in which case BatchClient will automatically select an appropriate delimiter for each record (i.e. one that does not appear in the field data). Automatically selected delimiters will be 0x01, 0x02 etc.

The following example (with C# BatchClient connectionstrings) loads data from and writes results to an MS SQL database.

<?xml version="1.0" encoding="utf-8" ?>
<config>

  <input>
    <!-- Define one data source for single table Matching -->
    <dataSource>
      <connectionString>Data Source=HOST; Initial Catalog=DATABASE; Integrated Security=SSPI;</connectionString>
      <table>TABLE</table>
      <filename>DRIVE:\PATH\FILENAME.txt</filename>
      <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns>
    </dataSource>
  </input>

  <!-- Output database in which to create tables for MatchingPairs, GroupedMatchingPairs,
       MatchingGroups, DedupedData, DuplicateData.
       Only the output types enabled in the mHub configuration file will be output. -->
  <output>
    <connectionString>Data Source=HOST; Initial Catalog=DATABASE; Integrated Security=SSPI;</connectionString>
    <filename>DRIVE:\PATH\FILENAME.txt</filename>
  </output>

</config>

 

 

 

C# & Java proxy engine

The C# BatchClient use matchIT.Hub.ProxyEngine and the Java BatchClient uses com.matchIT.Hub.ProxyEngine. ProxyEngine encapsulates the http communication with the Hub Service in a class that has the same interface as the Hub API with the following additions:

Host

  • public void setHost(string hostName)
    Call this to set the host and port of the matchIT Hub Service. The default is localhost:8080.

Engines

  • public int[] getEngines()
    Returns an array of existing engine instance numbers.
  • public void createEngine()
    Create a new engine instance.
  • public void createEngine(String name)
    Create a new named engine instance.
  • public int getEngineId()
    Return the id number of the engine instance currently in use (so you can reuse it later).
  • public String getEngineName()
    Return the name (if any) of the engine instance currently in use.
  • public void setEngineId(int engineId)
    Call this instead of createEngine() if you want to reuse an existing engine already instatiated in the service and you know the id.
  • public void setEngineByName(String name)
    Call this instead of createEngine() if you want to reuse an existing engine already instatiated in the service and you know the name.
  • public void deleteEngine(int engineId)
    Delete a specified engine instance.
  • public void deleteEngine()
    Deletes the engine instance currently in use. I.e. The one specified in a prior call to setEngineId() or one created by a prior call to createEngine().

Logging

  • public void setLogging( string filename, string severity )
    Configures logging.

Configuration

  • public void setUploadBlockRecordLimit( int recordLimit )
    Set the maximum number of records to send per block.
  • public void setDownloadBlockRecordLimit(int recordLimit)
    Set the maximum number of records to download per block.
  • public void setCompressionMinSize(int minSize)
    The the block size (in bytes) over which to use compression.

Data

  • public void addTransaction( int table, string data, int timeoutInMS )
    Applies an A(dd), U(pdate), or D(elete) transaction. See the matchIT Hub Service documentation for "POST tables/<0,1,2>/update" for details.
  • public void flush(int table)
    Because both addData() and addTransaction(), behind the scenes, batch the records into blocks, when you are finished making calls to either of these methods you need to call flush(). N.B. You must also call flush() between adding data and adding transactions.
  • public void setInputFile(int Table, string Delimiter, string Filename, string Encoding, bool NoMore, int timeoutInMS)
    Post a filename to the service to load data from.
  • public void setUpdateFile(int Table, string Delimiter, string Filename, string Encoding, bool NoMore, int timeoutInMS)
    Post a filename to the service to load transactions from.

Results

  • public void setOutputFile(string Filename, string Encoding, int timeoutInMS)
    Post a filename to the service to write results to.
  • public string openDownloadStream(int timeoutInMS)
    Whilst records are still being added you can call getNextResult() to fetch any matching pairs one by one. Once you are finished adding data (and after waiting for the service to start producing results) you can call openDownloadStream() and thereafter getNextResult() will stream the results rather than making multiple http POSTs. openDownloadStream() returns the first record from the stream. If you have set a download block record limit, getNextResult() will return null when the end of the current block is reached. At which point, if more results are still available, you should call openDownloadStream() again to fetch the next block.

You can reuse ProxyEngine class as-is in your own projects to avoid the the need to deal directly with the REST Service, or use it as sample code that demonstrates how to make http calls.

Was this article helpful?
0 out of 0 found this helpful

have a question or not finding what you're looking for?

Submit a ticket to get some help