12/4/2023 0 Comments Duplicate detector application![]() We present one such application, filtering near-duplicates to improve productivity of human analysts in a situational awareness tool. Given its performance and online nature, our method can be used in many real-world applications. ![]() Our online system presents state-of-the-art F1-scores, and can be tuned to trade precision for recall and vice-versa. This system adapts the shingling algorithm proposed by Broder (1997), and we test it on a challenging dataset of web-based news articles. We propose an online system which flags a near-duplicate document by finding its most likely original. Previous near-duplicate detection methods typically work offline to identify all near-duplicate pairs in a set of documents. Filtering near-duplicates out of a collection is thus important, and is particularly challenging in applications that require them to be filtered out in real-time with high precision. Near-duplicate documents have potentially significant costs, including bloating corpora with redundant information (biasing techniques built upon such corpora) and requiring additional human and computational analytic resources for marginal benefit. Editors often update wirefeed articles to address space constraints in print editions or to add local context journalists often lightly modify previous articles with new information or minor corrections. Abstract Near-duplicate documents are particularly common in news media corpora.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |