IdentityResolution

Identity resolution methods (also known as data matching or record linkage methods) identify records that describe the same real-world entity. This page introduces the different pre-implement methods and building blocks.

Rule-based identity resolution

First we load the two data sets

// loading data
HashedDataSet<Movie, Attribute> dataAcademyAwards = new HashedDataSet<>();
new MovieXMLReader().loadFromXML(new File("academy_awards.xml"), "/movies/movie", dataAcademyAwards);
HashedDataSet<Movie, Attribute> dataActors = new HashedDataSet<>();
new MovieXMLReader().loadFromXML(new File("actors.xml"), "/movies/movie", dataActors);

Then we define a matching rule that compares the records. We compare the movie title with Jaccard similarity and the release date with a custom date similarity function. Then we use a linear combination of the title similarity with a weight of 80% and the release date with a weight of 20% and a final similarity threshold of 70%.

// create a matching rule
LinearCombinationMatchingRule<Movie, Attribute> matchingRule = new LinearCombinationMatchingRule<>(0.7);
// add comparators
matchingRule.addComparator(
(m1,  m2, c) -> new TokenizingJaccardSimilarity().calculate(m1.getTitle(), m2.getTitle()), 0.8);
matchingRule.addComparator(
(m1, m2, c) -> new YearSimilarity(10).calculate(m1.getDate(), m2.getDate()), 0.2);

To speed up the whole process, we only want to compare records that seem similar instead of comparing all records. Hence, we add a blocking strategy that only compares movies from the same decade.

// create a blocker (blocking strategy)
Blocker<Movie, Attribute> blocker = new StandardBlocker<Movie, Attribute>(
(m) -> Integer.toString(m.getDate().getYear() / 10));

Finally, we initialise the matching engine, which does all the work for us, and run the identity resolution implementation with our matching rule.

// Initialize Matching Engine
MatchingEngine<Movie, Attribute> engine = new MatchingEngine<>();

// Execute the matching
Result<Correspondence<Movie, Attribute>> correspondences = engine.runIdentityResolution(dataAcademyAwards, dataActors, null, matchingRule, blocker);

To see how good our result is, we apply the built-in evaluation methods.

// load the gold standard (test set)
MatchingGoldStandard gsTest = new MatchingGoldStandard();
gsTest.loadFromCSVFile(new File("gs_academy_awards_2_actors_v2.csv"));

// evaluate the result
MatchingEvaluator<Movie, Attribute> evaluator = new MatchingEvaluator<Movie, Attribute>(true);
Performance perfTest = evaluator.evaluateMatching(correspondences.get(),gsTest);

// print the evaluation result
System.out.println("Academy Awards <-> Actors");
System.out.println(String.format(						"Precision: %.4f\nRecall: %.4f\nF1: %.4f",
	perfTest.getPrecision(), perfTest.getRecall(),perfTest.getF1()));

Learning Matching Rules

Instead of defining the matching rule by ourselves, we use machine learning to train a classifier, which matches the entries.

After loading our data like for Rule based identity resolution, we also load a training set for our classifier.

// load the gold standard (training set)
MatchingGoldStandard gsTraining = new MatchingGoldStandard();
gsTraining.loadFromCSVFile(new File("usecase/movie/goldstandard/gs_academy_awards_2_actors.csv"));

In a next step the classifier needs to be choosen. Winter uses the machine learning algorithms provided by WEKA. Therefore a couple of variants exist to initialize a classifier for the matching rule. Please check out the Weka classifier´s documentation to understand the classifier and the options you can choose from.

Plain vanilla

// create a matching rule & provide classifier, options
String tree = "J48"; // new instance of tree
String options[] = new String[1];
options[0] = "-U";

WekaMatchingRule<Movie, Attribute> matchingRule = new WekaMatchingRule<>(0.8, tree, options);

A meta-classifier and a base-classifier

// create a matching rule + provide classifier, options
String adaBoost = "AdaBoostM1";
String metaOptions[] = new String[2];
metaOptions[0] = "-P";
metaOptions[1] = "90";
String tree = "J48";
String baseOptions[] = new String[1];
baseOptions[0] = "-U";

WekaMatchingRule<Movie, Attribute> matchingRule = new WekaMatchingRule<>(0.8, adaBoost, tree, metaOptions, baseOptions);

One meta-classifier and multiple base-classifiers

// create a matching rule + provide classifier, options
String vote = "Vote";
String metaOptions[] = new String[1];
metaOptions[0] = "";
String baseClassifiers[] = new String[2];
baseClassifiers[0] =	"J48";
baseClassifiers[1] =	"SimpleLogistic";
String baseOptions[][] = new String[2][1];
baseOptions[0][0] = "-U";
baseOptions[1][0] = "";

WekaMatchingRule<Movie, Attribute> matchingRule = new WekaMatchingRule<>(0.8, vote, baseClassifiers, metaOptions, baseOptions);

Additionally a forward or a backward selection can be applied to improve the feature subset selection. Both selections are performed inside a 10-fold cross-validation to provide solid results.

Forward selection

matchingRule.setForwardSelection(true);

Backward selection

matchingRule.setBackwardSelection(true);

Then the compartors are selected, which provide the similarity values for a feature comparison vector.

// add comparators
matchingRule.addComparator(new MovieTitleComparatorEqual());
matchingRule.addComparator(new MovieDateComparator2Years());
matchingRule.addComparator(new MovieDateComparator10Years());
matchingRule.addComparator(new MovieDirectorComparatorJaccard());

Besides these dedicated Movie comparators, more general Record Comparators are included in this, which can be used to be more flexible.

After creating a blocker for the Rule based identity resolution the matching rule needs to be trained.

// learning Matching rule
RuleLearner<Movie, Attribute> learner = new RuleLearner<>();
learner.learnMatchingRule(dataAcademyAwards, dataActors, null, matchingRule, gsTraining);

This newly trained matching rule can be stored to reuse it.

// Store Matching Rule
matchingRule.storeModel(new File("usecase/movie/output/model"));

To reuse the model read from the file system instead of initializing and training a new matching rule. Please note that it is possible to load PMML based models as well as WEKA models.

// Store Matching Rule
matchingRule.readModel(new File("usecase/movie/output/model"));

The steps Blocking, MatchingEngine Initialization and Evaluation are equivalent to the ones performed for the rule based approach.

Contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IdentityResolution

Rule-based identity resolution

Learning Matching Rules

Clone this wiki locally