TY - JOUR
T1 - Automatic identification of variables in epidemiological datasets using logic regression
AU - Lorenz, Matthias W.
AU - Abdi, Negin Ashtiani
AU - Scheckenbach, Frank
AU - Pflug, Anja
AU - Bülbül, Alpaslan
AU - Catapano, Alberico
AU - Agewall, Stefan
AU - Ezhov, Marat
AU - Bots, Michiel L.
AU - Kiechl, Stefan
AU - Orth, Andreas
AU - Norata, Giuseppe D.
AU - Empana, Jean Philippe
AU - Lin, Hung Ju
AU - McLachlan, Stela
AU - Bokemark, Lena
AU - Ronkainen, Kimmo
AU - Amato, Mauro
AU - Schminke, Ulf
AU - Srinivasan, Sathanur R.
AU - Lind, Lars
AU - Kato, Akihiko
AU - Dimitriadis, Chrystosomos
AU - Przewlocki, Tadeusz
AU - Okazaki, Shuhei
AU - Stehouwer, C. D.A.
AU - Lazarevic, Tatjana
AU - Willeit, Peter
AU - Yanez, David N.
AU - Steinmetz, Helmuth
AU - Sander, Dirk
AU - Poppert, Holger
AU - Desvarieux, Moise
AU - Ikram, M. Arfan
AU - Bevc, Sebastjan
AU - Staub, Daniel
AU - Sirtori, Cesare R.
AU - Iglseder, Bernhard
AU - Engström, Gunnar
AU - Tripepi, Giovanni
AU - Beloqui, Oscar
AU - Lee, Moo Sik
AU - Friera, Alfonsa
AU - Xie, Wuxiang
AU - Grigore, Liliana
AU - Plichart, Matthieu
AU - Su, Ta Chen
AU - Robertson, Christine
AU - Nijpels, Giel
AU - Dekker, Jaqueline M.
AU - PROG-IMT study group
PY - 2017/4/13
Y1 - 2017/4/13
N2 - Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
AB - Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
KW - Data management
KW - Epidemiology
KW - Logic regression
KW - Meta-analysis
UR - http://www.scopus.com/inward/record.url?scp=85018523489&partnerID=8YFLogxK
U2 - 10.1186/s12911-017-0429-1
DO - 10.1186/s12911-017-0429-1
M3 - Article
C2 - 28407816
AN - SCOPUS:85018523489
VL - 17
JO - BMC Medical Informatics and Decision Making
JF - BMC Medical Informatics and Decision Making
SN - 1472-6947
IS - 1
M1 - 40
ER -