Using the API

An application that uses mHUB creates an engine. The engine provides a number of functions.

(Multiple engines can be created within a single process. Care must be taken, however, to ensure that the engines do not overload the system, in terms of CPU, RAM, and disk usage, and cause other running apps to become unresponsive.)

For general information on integrating mHUB refer to the Integration Guide.

For full details on the API, please refer to the relevant Reference Guide (C/C++/.NET/Java).

 

Initialization

Before any engine function can be used, Initialize() must be called to apply an activation code. This is usually provided in the form of a license file, the contents of which must be read into a string and passed as an argument to the function. The filename cannot be passed as an argument. This mechanism provides flexibility in how the activation code is stored, for example:

  • it can be used as-is, stored in the original file that was provided;
  • it can be embedded within the user's application (which will need to be recompiled should the license expire and a new one provided);
  • or it can be stored in a table within a database, for example in a cloud-based deployment that doesn't require access to a disk filing system.

 

Settings

The application then applies configuration settings. These are specified in an XML-formatted string that's passed into the ApplySettings() function. As per the activation code there's maximum flexibility: the settings can be specified within a file or a table within a database, they can be hardcoded within the application's source code, or they can even be dynamically generated at runtime.

ApplySettings() can be called multiple times, for example if configurations are split into more discrete blocks that are shared by multiple mHUB processes.

 

Loading Data

Once the engine has been configured and the processing mode applied (see the previous chapter, Processing Modes), data can be loaded into the engine. This is achieved by repeatedly calling AddData() and passing one record at a time. The data itself must be a delimited string that conforms to the input columns definition in the applied configuration.

Note that the first character of the data string must be the non-alphanumeric delimiter character. The delimiter can even be determined dynamically at runtime, the same character doesn't have to be used for each row of data.

Once all data has been loaded - and the engine is running in matching or normalization mode - NoMoreData() must then be called.

Note that whilst the calling application is waiting for the processing to finish, any results must be retrieved from the output buffer (see below). Should the output buffer ever become full, any further processing of unprocessed data will be paused until space becomes available. If the results aren't retrieved by the calling application, then processing will appear to be deadlocked.

Matching

Note that if overlapping data from two separate sources, the data can be loaded either sequentially (table 1 thentable 2) or concurrently (e.g. table 1 from one thread and table 2 from a second thread).

All incoming data is stored in memory in 'clusters' of similar records. These clusters are determined by the match keys in use (refer to the Configuration Guide for further details). When incoming data is added to an existing cluster, the new data is compared to the existing data.

After calling NoMoreData(), the calling application must then wait until processing has finished. This will be a minimal period of time if little data was loaded, if only matching pairs are output, or if a small input buffer is in use; alternatively, final processing could take several minutes if a large volume of data was loaded and/or all matching pairs are to be grouped.

 

Updating Data

After loading data you can retrieve the records by their unique ref, and between loading data and calling NoMoreData() you can delete and modify records. If you need to modify or delete records after calling NoMoreData() you can call AllowMoreData().

When a record is deleted all trace of the record is removed from the table, the clustered data and any grouped matches. A record is updated by passing a new version of the record, the old version is then deleted and the new version added.

When DeleteData() or UpdateData() are called with a unique ref that does not exist, owing to the multi-threaded nature of mHUB, these methods do not immediately indicate that the record was not found. However, an error message will be added to the error stack of the form "Record <unique ref> not found". Refer to the relevant Reference Guide (C/C++/.NET/Java) for details of retrieving error messages from the error stack.

 

Retrieving Results

Depending on the types of results being output by the engine, these can be retrieved either on the fly whilst data is still being loaded (in the same thread or in a separate thread), or at the end of the process.

If only matching pairs are being output, or if outputting normalized data, then these are output on the fly. Otherwise, grouping of matching pairs is performed, which can only take place once all data has been loaded.

Use GetResultCount() to determine whether there are results available. If none are available, then load more data (if retrieving results on the fly) or perform a slight pause before calling GetResultCount() again.

When results are available, call GetNextResult() in a loop until the given number of results have been read. Each result can be written to file, a table, or further processed.

Note that the value returned from GetResultCount() does not necessarily correspond to the number of results retrieved by GetNextResult(). GetResultCount() returns the total number of buffered results at that instant; when this number of results has been retrieved, further results could have been buffered in the meantime.

 

Finishing

If the engine is running in matching or normalization mode, and all input data has been loaded and processed and all results have been retrieved, then the engine will become idle. It can then be destroyed, or Reset() to load and process new data using the current settings (different settings can also be applied before any new data is loaded).

 

Multiple Overlaps

You can run multiple overlaps against the same reference data - without having to reload the table 1 data. To do this, once the engine is in the Finished state, call ClearData(2) to clear the current table 2 data and prepare the table to receive fresh data. You can change certain settings between overlaps by calling ApplySettings() again before beginning to add new data (refer to the Configuration Guide, Appendix G for further details). If you do call ApplySettings() there is no need to call ClearData(2) as the engine will do this automatically.

If you need to apply updates to the table 1 data between overlaps, you can do this by first calling AllowMoreData(1).