Machine Learning in Clinical Journals
See Article by Stevens et al
Although machine learning (ML) algorithms have grown more prevalent in clinical journals, inconsistent reporting of methods has led to skepticism about ML results and has blunted their adoption into clinical practice. A common problem authors face when reporting on ML methods is the lack of a single reporting guideline that applies to the panoply of ML problem types and approaches. Responding to this concern, Stevens et al1 in this issue of Circulation: Cardiovascular Quality and Outcomes propose a set of reporting recommendations for ML papers. This is meant to augment existing (but more general) reporting guidelines on predictive models, such as the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement2—recognizing that international efforts are underway to address these concerns more fully, including TRIPOD-ML and Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI).3,4 However, a more fundamental question remains: will such guidelines address all of the different types of ML methods increasingly used in the published literature (such as unsupervised learning) based on their proposed scope?
This is not a small problem in a dynamic field full of rapid change. Although Stevens et al focus on the spectrum of ML problem types between unsupervised and supervised learning, for instance, more recent literature has also seen the application of reinforcement learning methods to clinical problems.5,6 Additionally, the phrase “machine learning” can be used to refer to a number of different algorithms, which include decision trees, random forests, gradient-boosted machines, support vector machines, and neural networks. Perhaps somewhat confusingly, the phrase is also commonly applied to describe methods like linear regression and Bayesian models that readers of clinical journals are more familiar with. Thus, it is not uncommon to read an article purport to use “machine learning” in the title, only to find the actual algorithm is a penalized regression model, which a statistical reviewer may consider to be a traditional statistics model.
As with many types of clinical research, we think standardized reporting of ML methods is a needed advancement and will be helpful for editors and reviewers evaluating the quality of manuscripts. Ultimately, this will be of great benefit to readers of the published article. In this editorial, however, we describe ongoing difficulties in determining what is and what is not an ML manuscript to which Stevens et al’s reporting recommendations would apply. We then discuss how this determination impacts how an article is perceived and judged by journals and readers. On the one hand, strong model performance described in articles may be the result of overfitting, and this could be prevented with better reporting practices. On the other hand, valid ML findings may be overlooked by even experienced statistical editors because of lack of familiarity with specific methods. Although the Stevens et al recommendations provide an important first step toward improving ML articles, we propose that clinical journals also may benefit from creating ML editorship roles (as Circulation: Cardiovascular Quality and Outcomes is doing) to work alongside statistical editors.
WHAT’S in a Name?
The wide range of ML problem types and approaches highlights a deeper and much more difficult-to-solve taxonomy challenge: what exactly is meant by the phrase “machine learning” in clinical journals. Furthermore, the ML community is heterogenous and comprises researchers with various backgrounds ranging from computer science, informatics, mathematics, operations engineering, and statistics. Given the interdisciplinary nature of ML, it can be difficult to understand how a proposed method may relate to traditional statistical models commonly found in the clinical literature. This complicates the review process as it is difficult to evaluate whether a proposed method constitutes a genuinely new advance. Stevens et al try to draw contrasts between ML methods and traditional statistical methods, which they sometimes refer to as hypothesis-driven approaches. This contrast is intended to help readers of clinical studies—who are often more familiar with traditional statistical methods—determine whether a given article uses ML methods and thus is subject to their reporting recommendations.
However, reasonable scientists may disagree (and often do!) on which approaches fall into the umbrella terms of “ML” and statistics, and much of this disagreement stems from the complicated history surrounding the discovery and use of these methods.7–9 For instance, lasso regression was popularized by a statistician but is used more commonly by ML practitioners (and sometimes referred to as L1-regularization).10 Similarly, the random forest algorithm was first proposed by a computer scientist but popularized and trademarked by 2 statisticians and yet is not considered a traditional statistics method. Even the use of hypothesis testing does not reliably determine that an algorithm belongs to traditional statistics. Conditional inference trees use a hypothesis test to determine optimal splits but are considered ML because of their resemblance to random forests.11
For this reason, we caution against the use of the phrase “machine learning” in clinical journals in isolation in the title. Instead, we suggest naming the specific algorithm used in the title (eg, K-means clustering or penalized regression) or omitting the method from the title in favor of a more general term (eg, clustering method or prediction model). If there is ambiguity in the problem type after naming the algorithm (which is exceedingly rare), then we suggest labeling it as a supervised or unsupervised learning task. This precision will allow readers to be in a better position to judge articles on the appropriate application of the algorithm as opposed to focusing on whether the article is, in fact, ML.
Stevens et al also suggest that authors provide a rationale for the use of ML methods in place of traditional statistics methods. As when any method is applied, we see value in describing to readers the reasoning for applying a specific tool. Yet, even a statement of rationale could invoke seemingly partisan divides between ML and statistics reviewers as the best approach may depend on the signal-to-noise ratio of the problem and may not be knowable in advance. On one hand, an independent evaluation of hundreds of algorithms on over a hundred publicly available datasets found ML algorithms to perform substantially better than logistic regression.12 On the other hand, 2 systematic reviews of the medical literature found no benefit of ML methods over logistic regression.13,14 These differences could result from the lower signal-to-noise ratio in biomedical data versus other types of data, publication bias, or model misspecification.
Better Reporting May Uncover Poor Modeling Practices
There are many situations, particularly with high-dimensional temporal (eg, clinical time series) or spatial data (eg, imaging), where ML methods outperform traditional statistical methods. Unfortunately, many common errors in the application of supervised ML methods lead to overestimation of model performance. For example, imputing data or applying data-driven feature selection methods before splitting the training and test sets can lead to overestimates in model performance. This phenomenon occurs relatively commonly and has been referred to as the Winner’s curse or testimation bias.15 Similarly, using cross-validation to select hyperparameters in the absence of nesting can lead to overestimates of internal validity. Finally, use of billing codes as predictors in electronic health record data may lead to favorably biased results because of delayed data entry because data provided to the model retrospectively may not be available if the model were run prospectively. Because clinical journals may be more likely to consider articles with strong model performance (eg, high area-under-the-curve), peer reviewers may be disproportionately asked to review articles with unrecognized overfitting.
Uncovering concerns like overfitting requires authors to be clear in their methods. Stevens et al identify a number of important considerations relevant to ML articles that deserve to be highlighted during reporting, including data quality, feature selection, model optimization (or tuning), and code sharing. Improving the reporting of ML algorithms in these aspects will improve the quality of clinical ML articles by enabling editors, reviewers, and readers to appropriately scrutinize and improve the quality of the work. However, these safeguards may not be sufficient for all concerns. For example, none of these would prevent the widespread use of models with substantial yet often unrecognized problems such as racial bias.16,17
While problems like racial bias are more apparent when the models are transparent (such as linear regression models or decision trees), the use of more opaque methods can obscure them. Interpreting the learned relationships between predictors and the outcome can help determine when a model may be learning an unintended representation. Although this is commonly done using permutation-based methods or class activation maps,18,19 even these methods may not provide an accurate picture of what the model has learned.20,21 Thus, use of model interpretability methods may instill a false sense of confidence in readers and should be reported with caution.
Sharing model objects can also facilitate independent evaluations of ML models—these objects are files that contain a representation of the model itself. Sharing of model objects is common in ML literature but rarely done in clinical journals.22–24 For example, neural network weights may be shared in hdf5, pt, or mojo file formats for models trained using the keras, PyTorch, and h2o packages, respectively. Another common way to save and share ML models is in native R (rds) or Python (pickle) binary files. Standard formats do exist for sharing ML models such as the Predictive Model Markup Language but limited integration with ML software packages have prevented widespread use.25,26 Few of these technical details are known to clinical researchers and model objects are rarely shared.
A Need for ML Editors at Clinical Journals
Statistical editors play a key role in ensuring that scientific research is conducted and reported in a sound manner. Faced with a manuscript using ML approaches, statistical editors may not always be in the best position to judge its scientific rigor. This is perhaps most true for research utilizing neural networks with complex architectures but applies to other algorithms as well. For instance, while there is helpful statistical guidance on the determination of appropriate sample sizes for prediction models,27 it is unclear how this guidance could be applied to models utilizing radiology images consisting of millions of pixels. If each pixel were considered a predictor variable, almost every such article would likely be flagged by a statistical editor as having an inadequate sample size. Now consider that an ML researcher may not refit the entire neural network on the radiology image data, deciding instead to reuse a model from a different domain and refitting only a subset of the parameters while freezing the rest of the model (an approach known as transfer learning). This approach can enable neural networks to be effectively fit on much smaller datasets through transfer of knowledge from larger to smaller datasets. Neural networks with a large number of parameters can be surprisingly robust on small datasets, but this depends on the choice of model architecture, activation functions, hyperparameters (such as learning rate), and checks to ensure model convergence.28
These kinds of nuances may be identified through peer review, but clinical journals that increasingly deal with ML manuscripts may benefit from a more consistent approach to evaluating them. Stevens et al provide one piece of this puzzle through their reporting recommendations but this approach needs to be augmented. The scientific quality of manuscripts reporting on ML models would be greatly improved with the creation of ML editorships, as some journals have already begun to do.29 Like statistical editors focused in other areas, ML editors would be helpful in identifying peer reviewers with specific expertise in the ML algorithm being applied in a given article and in ensuring that the description of the methods and results is accurate and appropriate for a clinical readership.
We have taken this approach at Circulation: Cardiovascular Quality and Outcomes where we have identified a new editor for handling ML manuscripts—cardiologist Dr Rashmee Shah from the University of Utah—and also a new member of the statistical editorial team—computer scientist Dr Bobak Mortazavi from the Texas A&M University. Combined with the reporting tools recommended by Stevens et al, we think these 2 additions to our team will create a more rigorous and useful process for internally evaluating ML methods in studies submitted to us. In the end, these changes will help our readers better understand and apply findings from these exciting studies to our patients moving forward.
Disclosures
The disclosures provided by Dr Nallamothu in compliance with American Heart Association’s annual Journal Editor Disclosure Questionnaire are available at https://www.ahajournals.org/pb-assets/COI_09-2019-1568653792133.pdf. The other authors report no conflicts.
Footnotes
References
- 1.
Stevens LM, Mortazavi BJ, Deo RC, Curtis L, Kao DK . Recommendations for reporting machine learning analyses in clinical research.Circ Cardiovasc Qual Outcomes. 2020; 13:e006556. doi: 10.1161/CIRCOUTCOMES.120.006556LinkGoogle Scholar - 2.
Collins GS, Reitsma JB, Altman DG, Moons KG ; TRIPOD Group. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group.Circulation. 2015; 131:211–219. doi: 10.1161/CIRCULATIONAHA.114.014508LinkGoogle Scholar - 3.
Collins GS, Moons KGM . Reporting of artificial intelligence prediction models.Lancet. 2019; 393:1577–1579. doi: 10.1016/S0140-6736(19)30037-6CrossrefMedlineGoogle Scholar - 4.
Liu X, Faes L, Calvert MJ, Denniston AK ; CONSORT/SPIRIT-AI Extension Group. Extension of the CONSORT and SPIRIT statements.Lancet. 2019; 394:1225. doi: 10.1016/S0140-6736(19)31819-7CrossrefMedlineGoogle Scholar - 5.
Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA . The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care.Nat Med. 2018; 24:1716–1720. doi: 10.1038/s41591-018-0213-5CrossrefMedlineGoogle Scholar - 6.
Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, Celi LA . Guidelines for reinforcement learning in healthcare.Nat Med. 2019; 25:16–18. doi: 10.1038/s41591-018-0310-5CrossrefMedlineGoogle Scholar - 7.
Breiman L . Statistical modeling: the two cultures (with comments and a rejoinder by the author).Stat Sci. 2001; 16:199–231CrossrefGoogle Scholar - 8.
Finlayson S . Comments on ML “versus” Statistics.2020. https://sgfin.github.io/2020/01/31/Comments-ML-Statistics/. Accessed June 2, 2020Google Scholar - 9.
Beam AL, Kohane IS . Big data and machine learning in health care.JAMA. 2018; 319:1317–1318. doi: 10.1001/jama.2017.18391CrossrefMedlineGoogle Scholar - 10.
Tibshirani R . Regression shrinkage and selection via the lasso: a retrospective.J R Stat Soc Series B (Stat Methodol). 2011; 73:273–282CrossrefGoogle Scholar - 11.
Hothorn T, Hornik K, Zeileis A . Unbiased recursive partitioning: a conditional inference framework.J Comput Graph Stat. 2006; 15:651–674CrossrefGoogle Scholar - 12.
Fernández-Delgado M, Cernadas E, Barro S, Amorim D . Do we need hundreds of classifiers to solve real world classification problems?J Mach Learn Res. 2014; 15:3133–3181Google Scholar - 13.
Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B . A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.J Clin Epidemiol. 2019; 110:12–22. doi: 10.1016/j.jclinepi.2019.02.004CrossrefMedlineGoogle Scholar - 14.
Mahmoudi E, Kamdar N, Kim N, Gonzales G, Singh K, Waljee AK . Use of electronic medical records in development and validation of risk prediction models of hospital readmission: systematic review.BMJ. 2020; 369:m958. doi: 10.1136/bmj.m958CrossrefMedlineGoogle Scholar - 15.
Steyerberg EW Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, Cham: Springer, 2019CrossrefGoogle Scholar - 16.
Obermeyer Z, Powers B, Vogeli C, Mullainathan S . Dissecting racial bias in an algorithm used to manage the health of populations.Science. 2019; 366:447–453. doi: 10.1126/science.aax2342CrossrefMedlineGoogle Scholar - 17.
Grother P, Ngan M, Hanaoka K Face Recognition Vendor Test (FVRT): Part 3, Demographic Effects. 2019https://nvlpubs.nist.gov/nistpubs/ir/2019/NIST.IR.8280.pdfCrossrefGoogle Scholar - 18.
Molnar C . Interpretable Machine Learning.2020. https://christophm.github.io/interpretable-ml-book/index.html. Accessed June 3, 2020Google Scholar - 19.
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A . Learning deep features for discriminative localization.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,Las Vegas, NV , 2016. pp. 2921–2929. doi: 10.1109/CVPR.2016.319CrossrefGoogle Scholar - 20.
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A . Conditional variable importance for random forests.BMC Bioinformatics. 2008; 9:307. doi: 10.1186/1471-2105-9-307CrossrefMedlineGoogle Scholar - 21.
Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B . Sanity checks for saliency maps. In:Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R , editors. Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 20189505–9515. http://papers.nips.cc/paper/8160-sanity-checks-for-saliency-maps.pdf Accessed Sept 25, 2020.Google Scholar - 22. Open Neural Network Exchange. ONNX.http://onnx.ai. Accessed June 3, 2020Google Scholar
- 23.
Auffenberg GB, Ghani KR, Ramani S, Usoro E, Denton B, Rogers C, Stockton B, Miller DC, Singh K ; Michigan Urological Surgery Improvement Collaborative. askMUSIC: leveraging a clinical registry to develop a new machine learning model to inform patients of prostate cancer treatments chosen by Similar Men.Eur Urol. 2019; 75:901–907. doi: 10.1016/j.eururo.2018.09.050CrossrefMedlineGoogle Scholar - 24.
Wong A, Young AT, Liang AS, Gonzales R, Douglas VC, Hadley D . Development and validation of an electronic health record-based machine learning model to estimate delirium risk in newly hospitalized patients without known cognitive impairment.JAMA Netw Open. 2018; 1:e181018. doi: 10.1001/jamanetworkopen.2018.1018CrossrefMedlineGoogle Scholar - 25.
Grossman R, Bailey S, Ramu A, Malhi B, Hallstrom P, Pulleyn I, Qin X . The management and mining of multiple predictive models using the predictive modeling markup language.Inform Softw Tech. 1999; 41:589–595CrossrefGoogle Scholar - 26. PMML 4.4 – General Structure.http://dmg.org/pmml/v4-4/GeneralStructure.html. Accessed June 9, 2020Google Scholar
- 27.
Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, Moons KGM, Collins G, van Smeden M . Calculating the sample size required for developing a clinical prediction model.BMJ. 2020; 368:m441. doi: 10.1136/bmj.m441CrossrefMedlineGoogle Scholar - 28.
Beam AL . ou can probably use deep learning even if your data isn’t that big.2017. http://beamlab.org/deeplearning/2017/06/04/deep_learning_works.html. Accessed June 4, 2020Google Scholar - 29. Waljee appointed associate editor for the journal Gut. 2020, Institute for Healthcare Policy & Innovation. https://ihpi.umich.edu/news/waljee-appointed-associate-editor-journal-gut. Accessed June 3, 2020Google Scholar