This paper addresses the name matching (duplicate detection) problem in the US patent dataset. It contains more then 400K unique company names spellings. In order to solve the matching problem we choose appropriate string similarity measure and clustering approach and estimate their parameters. Finally we apply them to the whole dataset and estimate the positives and negatives rates.
Monday, July 25, 2011
A new paper from HP discusses the problems and an automated solution to distinguishing corporate names. Company Names Matching in the Large Patents Dataset by Timofey Medvedev and Alexander Ulanov, HP Laboratories, HPL-2011-90R1.