RecordLinkage: Record Linkage in R

> library(RecordLinkage)
Loading required package: DBI

Attaching package: 'DBI'
The following object is masked from 'package:git2r':

    fetch
Loading required package: RSQLite
Loading required package: ff
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: 'bit'
The following object is masked from 'package:git2r':

    clone
The following object is masked from 'package:base':

    xor
Attaching package ff
- getOption("fftempdir")=="/var/folders/8f/s_lbgwks6q7g3lz52q93ngph0000gn/T//RtmpOF72Zx"
- getOption("ffextension")=="ff"
- getOption("ffdrop")==TRUE
- getOption("fffinonexit")==TRUE
- getOption("ffpagesize")==65536
- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes
- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system
- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system

Attaching package: 'ff'
The following objects are masked from 'package:bit':

    clone, clone.default, clone.list
The following objects are masked from 'package:utils':

    write.csv, write.csv2
The following objects are masked from 'package:git2r':

    add, clone
The following object is masked from 'package:magrittr':

    add
The following objects are masked from 'package:base':

    is.factor, is.ordered
Loading required package: ffbase

Attaching package: 'ffbase'
The following objects are masked from 'package:ff':

    [.ff, [.ffdf, [<-.ff, [<-.ffdf
The following objects are masked from 'package:base':

    %in%, table
RecordLinkage library
[c] IMBEI Mainz

Attaching package: 'RecordLinkage'
The following object is masked from 'package:ff':

    clone
The following object is masked from 'package:bit':

    clone
The following object is masked from 'package:git2r':

    clone

バージョン: 0.4.8


関数名 概略
%append%-methods Concatenate comparison patterns or classification results
RLBigData-class Class "RLBigData"
RLBigDataDedup Constructors for big data objects.
RLBigDataDedup-class Class "RLBigDataDedup"
RLBigDataLinkage-class Class "RLBigDataLinkage"
RLResult-class Class "RLResult"
RLdata500 Test data for Record Linkage
RecLinkClassif-class Class "RecLinkClassif"
RecLinkData-class Class "RecLinkData"
RecLinkData.object Record Linkage Data Object
RecLinkResult Record Linkage Result Object
RecLinkResult-class Class "RecLinkResult"
[.RecLinkData Subset operator for record linkage objects
classifySupv Supervised Classification
classifyUnsup Unsupervised Classification
clone Serialization of record linkage object.
compare.dedup Compare Records
deleteNULLs Remove NULL Values
editMatch Edit Matching Status
emClassify Weight-based Classification of Data Pairs
emWeights Calculate weights
epiClassify Classify record pairs with EpiLink weights
epiWeights Calculate EpiLink weights
ff_vector-class Class '"ff_vector"'
ffdf-class Class '"ffdf"'
fsClassify Stochastic record linkage.
genSamples Generate Training Set
getErrorMeasures-methods Calculate Error Measures
getExpectedSize Estimate number of record pairs.
getFrequencies-methods Get attribute frequencies
getMinimalTrain Create a minimal training set
getPairs Extract Record Pairs
getParetoThreshold Estimate Threshold from Pareto Distribution
getTable-methods Build contingency table
gpdEst Estimate Threshold from Pareto Distribution
isFALSE Check for FALSE
mygllm Generalized Log-Linear Fitting
optimalThreshold Optimal Threshold for Record Linkage
phonetics Phonetic Code
resample Safe Sampling
show Show a RLBigData object
splitData Split Data
strcmp String Metrics
summary.RLBigData summary methods for '"RLBigData"' objects.
summary.RLResult Summary method for '"RLResult"' objects.
summary.RecLinkData Print Summary of Record Linkage Data
trainSupv Train a Classifier
unorderedPairs Create Unordered Pairs

compare.dedup

Arguments

  • dataset
  • dataset1, dataset2
  • phonetic
  • strcmp
  • strcmpfun
  • identity, identity1, identity2
  • n_match, n_non_match
> compare.dedup(RLdata500, identity=identity.RLdata500, strcmp=TRUE, blockfld=list(1,c(5,6,7)))
Error in is.data.frame(dataset): object 'RLdata500' not found

RLdata500

> data("RLdata500")
> RLdata500 %>% str()
'data.frame':    500 obs. of  7 variables:
 $ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
 $ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
 $ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
 $ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
 $ by      : int  1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
 $ bm      : int  7 7 4 9 1 7 8 9 3 2 ...
 $ bd      : int  22 27 30 2 13 4 1 20 4 27 ...

strcmp / jarowinkler / levenshteinSim / levenshteinDist

文字列の類似度を計算する

Arguments

  • str1, str2
  • W_1, W_2, W_3
  • r
> jarowinkler("Apple", "Apple")
[1] 1
> jarowinkler("Apple", "Andreas")
[1] 0.6057143
> levenshteinSim("Andreas", c("Anreas", "Andeas"))
[1] 0.8571429 0.8571429
> jarowinkler(c("Andreas", "Borg"), c("Andreas", "Bork"))
[1] 1.0000000 0.8833333
> levenshteinSim("Andreas", c("Anreas", "Andeas"))
[1] 0.8571429 0.8571429

trainSupv

> trainSupv(rpairs, method, use.pred = FALSE, omit.possible = TRUE, 
+   convert.na = TRUE, include.data = FALSE, ...)