Skip to main content

Advertisement

Log in

Duplicate detection in adverse drug reaction surveillance

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world’s largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The presence of duplicate case reports is an important data quality problem and their detection remains a formidable challenge, especially in the WHO drug safety database where reports are anonymised before submission. In this paper, we propose a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text). We propose two extensions of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields, and we demonstrate the effectiveness both at identifying the most likely duplicate for a given case report (94.7% accuracy) and at discriminating true duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other knowledge discovery applications as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Bate A, Lindquist M, Edwards IR, Olsson S, Orre R, Lansner A, De Freitas RM (1998) A Bayesian neural network method for adverse drug reaction signal generation. Eur J Clin Pharmacol 54:315–321

    Article  Google Scholar 

  • Belin T, Rubin D (1995) A method for calibrating false-match rates in record linkage. J Am Stat Assoc 90: 694–707

    Article  MATH  Google Scholar 

  • Bilenko M, Mooney RJ (2003a) Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 39–48

  • Bilenko M, Mooney RJ (2003b) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage and object consolidation, pp 7–12

  • Bortnichak EA, Wise RP, Salive ME, Tilson HH (2001) Proactive safety surveillance. Pharmacoepidemiol Drug Safety 10:191–196

    Article  Google Scholar 

  • Brinker AD, Beitz J (2002) Spontaneous reports of thrombocytopenia in association with quinine: clinical attributes and timing related to regulatory action. Am J Hematol 70:313–317

    Article  Google Scholar 

  • Copas J, Hilton F (1990) Record linkage: statistical models for matching computer records. J R Stat Soc: Sers A 153(3):287–320

    Article  Google Scholar 

  • De Veaux RD, Hand DJ (2005) How to lie with bad data. Stat Sci 20(3):231–238

    Article  MATH  Google Scholar 

  • Edwards IR (1997) Adverse drug reactions: finding the needle in the haystack. Br Med J 315(7107):500

    Google Scholar 

  • Edwards IR (1999) Spontaneous reporting – of what? Clinical concerns about drugs. Br J Clin Pharmacol 48(2):138–141

    Article  Google Scholar 

  • Edwards IR, Aronson JK (2000) Adverse drug reactions: definitions, diagnosis and management. Lancet 356(9237):1255–1259

    Article  Google Scholar 

  • Evans SJW (2000) Pharmacovigilance: a science or fielding emergencies? Stat Med 19(23):3199–3209

    Article  Google Scholar 

  • Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) The KDD process for extracting useful knowledge from volumes of data. Commun ACM 39(11):27–34

    Article  Google Scholar 

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183–1210

    Article  Google Scholar 

  • Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37

    Article  Google Scholar 

  • Jaro M (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84:414–420

    Article  Google Scholar 

  • Kim WY, Choi B-J, Hong EK, Kim S-K, Lee D (2003) A taxonomy of dirty data. Data Min Knowl Discov 7(1):81–99

    Article  MathSciNet  Google Scholar 

  • Lindquist M (2004) Data quality management in pharmacovigilance. Drug Safety 27(12):857–870

    Article  Google Scholar 

  • Monge AE, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. Research issues on data mining knowledge discovery, Tucson, AZ

    Google Scholar 

  • Newcombe HB, Kennedy JM (1962) Record linkage: making maximum use of the discriminating power of identifying information. Commun ACM 5(11):563–566

    Article  Google Scholar 

  • Nkanza JN, Walop W (2004) Vaccine associated adverse event surveillance (VAAES) and quality assurance. Drug Safety 27:951–952

    Google Scholar 

  • Norén GN, Bate A, Orre R, Edwards IR (2006) Extending the methods used to screen the WHO drug safety database towards analysis of complex associations and improved accuracy for rare events. Stat Med 25(21): 3740–3757

    Article  MathSciNet  Google Scholar 

  • Norén GN, Orre R, Bate A (2005) A hit-miss model for duplicate detection in the WHO drug safety database. In: KDD ’05: proceeding of the 11th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 459–468

  • Orre R, Lansner A, Bate A, Lindquist M (2000) Bayesian neural networks with confidence estimations applied to data mining. Comput Stat Data Anal 34:473–493

    Article  MATH  Google Scholar 

  • Rawlins MD (1988) Spontaneous reporting of adverse drug reactions. II: Uses. Br J Clin Pharmacol 1(26):7–11

    MathSciNet  Google Scholar 

  • Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 269–278

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Niklas Norén.

Additional information

Responsible editor: Hannu Toivonen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Norén, G.N., Orre, R., Bate, A. et al. Duplicate detection in adverse drug reaction surveillance. Data Min Knowl Disc 14, 305–328 (2007). https://doi.org/10.1007/s10618-006-0052-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-006-0052-8

Keywords

Navigation