forked from olehmberg/winter
-
Notifications
You must be signed in to change notification settings - Fork 1
IdentityResolution
Alex0815 edited this page Jun 21, 2017
·
11 revisions
Identity resolution methods (also known as data matching or record linkage methods) identify records that describe the same real-world entity. This page introduces the different pre-implement methods and building blocks.
- First we load the two data sets
// loading data
HashedDataSet<Movie, Attribute> dataAcademyAwards = new HashedDataSet<>();
new MovieXMLReader().loadFromXML(new File("academy_awards.xml"), "/movies/movie", dataAcademyAwards);
HashedDataSet<Movie, Attribute> dataActors = new HashedDataSet<>();
new MovieXMLReader().loadFromXML(new File("actors.xml"), "/movies/movie", dataActors);
- Then we define a matching rule that compares the records. We compare the movie title with Jaccard similarity and the release date with a custom date similarity function. Then we use a linear combination of the title similarity with a weight of 80% and the release date with a weight of 20% and a final similarity threshold of 70%.
// create a matching rule
LinearCombinationMatchingRule<Movie, Attribute> matchingRule = new LinearCombinationMatchingRule<>(0.7);
// add comparators
matchingRule.addComparator(
(m1, m2, c) -> new TokenizingJaccardSimilarity().calculate(m1.getTitle(), m2.getTitle()), 0.8);
matchingRule.addComparator(
(m1, m2, c) -> new YearSimilarity(10).calculate(m1.getDate(), m2.getDate()), 0.2);
- To speed up the whole process, we only want to compare records that seem similar instead of comparing all records. Hence, we add a blocking strategy that only compares movies from the same decade.
// create a blocker (blocking strategy)
Blocker<Movie, Attribute> blocker = new StandardBlocker<Movie, Attribute>(
(m) -> Integer.toString(m.getDate().getYear() / 10));
- Finally, we initialise the matching engine, which does all the work for us, and run the identity resolution implementation with our matching rule.
// Initialize Matching Engine
MatchingEngine<Movie, Attribute> engine = new MatchingEngine<>();
// Execute the matching
Result<Correspondence<Movie, Attribute>> correspondences = engine.runIdentityResolution(dataAcademyAwards, dataActors, null, matchingRule, blocker);
- To see how good our result is, we apply the built-in evaluation methods.
// load the gold standard (test set)
MatchingGoldStandard gsTest = new MatchingGoldStandard();
gsTest.loadFromCSVFile(new File("gs_academy_awards_2_actors_v2.csv"));
// evaluate the result
MatchingEvaluator<Movie, Attribute> evaluator = new MatchingEvaluator<Movie, Attribute>(true);
Performance perfTest = evaluator.evaluateMatching(correspondences.get(),gsTest);
// print the evaluation result
System.out.println("Academy Awards <-> Actors");
System.out.println(String.format( "Precision: %.4f\nRecall: %.4f\nF1: %.4f",
perfTest.getPrecision(), perfTest.getRecall(),perfTest.getF1()));
Instead of defining the matching rule by ourselves, we use machine learning to train a classifier, which matches the entries.
- After loading our data like for Rule based identity resolution, we also load a training set for our classifier.
// load the gold standard (training set)
MatchingGoldStandard gsTraining = new MatchingGoldStandard();
gsTraining.loadFromCSVFile(new File("usecase/movie/goldstandard/gs_academy_awards_2_actors.csv"));
- In a next step the classifier needs to be choosen. Winter uses the machine learning algorithms provided by WEKA. Therefore a couple of variants exist to initialize a classifier for the matching rule. Please check out the Weka classifier´s documentation to understand the classifier and the options you can choose from.
Plain vanilla
// create a matching rule & provide classifier, options
String tree = "J48"; // new instance of tree
String options[] = new String[1];
options[0] = "-U";
WekaMatchingRule<Movie, Attribute> matchingRule = new WekaMatchingRule<>(0.8, tree, options);
A meta-classifier and a base-classifier
// create a matching rule + provide classifier, options
String adaBoost = "AdaBoostM1";
String metaOptions[] = new String[2];
metaOptions[0] = "-P";
metaOptions[1] = "90";
String tree = "J48";
String baseOptions[] = new String[1];
baseOptions[0] = "-U";
WekaMatchingRule<Movie, Attribute> matchingRule = new WekaMatchingRule<>(0.8, adaBoost, tree, metaOptions, baseOptions);
One meta-classifier and multiple base-classifiers
// create a matching rule + provide classifier, options
String vote = "Vote";
String metaOptions[] = new String[1];
metaOptions[0] = "";
String baseClassifiers[] = new String[2];
baseClassifiers[0] = "J48";
baseClassifiers[1] = "SimpleLogistic";
String baseOptions[][] = new String[2][1];
baseOptions[0][0] = "-U";
baseOptions[1][0] = "";
WekaMatchingRule<Movie, Attribute> matchingRule = new WekaMatchingRule<>(0.8, vote, baseClassifiers, metaOptions, baseOptions);
- Additionally a forward or a backward selection can be applied to improve the feature subset selection. Both selections are performed inside a 10-fold cross-validation to provide solid results.
Forward selection
matchingRule.setForwardSelection(true);
Backward selection
matchingRule.setBackwardSelection(true);
- Then the compartors are selected, which provide the similarity values for a feature comparison vector.
// add comparators
matchingRule.addComparator(new MovieTitleComparatorEqual());
matchingRule.addComparator(new MovieDateComparator2Years());
matchingRule.addComparator(new MovieDateComparator10Years());
matchingRule.addComparator(new MovieDirectorComparatorJaccard());
Besides these dedicated Movie comparators, more general Record Comparators are included in this, which can be used to be more flexible.
- After creating a blocker for the Rule based identity resolution the matching rule needs to be trained.
// learning Matching rule
RuleLearner<Movie, Attribute> learner = new RuleLearner<>();
learner.learnMatchingRule(dataAcademyAwards, dataActors, null, matchingRule, gsTraining);
- This newly trained matching rule can be stored to reuse it.
// Store Matching Rule
matchingRule.storeModel(new File("usecase/movie/output/model"));
- To reuse the model read from the file system instead of initializing and training a new matching rule. Please note that it is possible to load PMML based models as well as WEKA models.
// Store Matching Rule
matchingRule.readModel(new File("usecase/movie/output/model"));
The steps Blocking, MatchingEngine Initialization and Evaluation are equivalent to the ones performed for the rule based approach.