The GitHub repository for the project code can be found here.

1. Abstract

A series of prior results, including those by Gichoya et al. (2022), have shown that it is possible to use deep convolutional neural networks to predict a patient’s self-reported race from chest radiographs with high accuracies. Since this result displays significant ethical concerns for medical imaging algorithms, we aim to reproduce these results and investigate the implications this algorithm could have for race-based medicine and the racial inequalities reinforced by algorithms. We use a subset of the chest radiographs obtained from the ChexPert dataset, aiming to classify images into Black, White, Asian. We primarily train and test on a subset of data with equal proportions amongst all races. In particular, we compare the results of pretrained and untrained ResNet18 models and the EfficientNetB0 model. Our results achieve around 70 % accuracy, displaying some racial bias and having minimal gender bias. We therefore conclude that, on a smaller scale, we have confirmed that it is indeed possible to train neural networks to accurately classify race from chest radiographs.

2. Introduction

Deep neural networks are increasingly getting popular in medicine as diagnostic tools. While at times suprassing the accuracy of experts, results such as those by Seyyed-Kalantari et al. (2021) show concerning results of underdiagnosis for patients that are Black, Hispanic, younger, or belonging to a lower socieconomic status groups. Problematically this reinforces a history of minority groups or economically vulnerable groups receiving inadequate medical care, especially when many publicly available datasets are disproportionately represent White patients (NEEDS CITATION).

As Seyyed-Kalantari et al. (2021) suggest, this may be a matter of confounding variables as bias amplification or differing prevalence. However, a paper by Gichoya et al. (2022) investigated the direct question - can race be inferred from chest X-rays? Clincially speaking this is something that is not expected, it is an implicit assumption that chest radiographs contain no information about one’s demographic characteristics, beyond those most relevant to physiology, such as age or biological sex. Many models specifically exclude characteristics so that classification is based solely on the image. However, deep neural networks are often a black box, being capable on picking up on pixel level patterns that are surprising.

Indeed, the authors Gichoya et al. (2022) found that by using self-reported race labels, those being Black, White, and Asian, it is possible to classify chest radiographs into these three categories with high accuracies (0.91–0.99 using AUC metrics). To the extent that they investigated, this was not based on potentially race related characteristics, including bone or breast density or disease prevalence. Even highly degraded versions of the image maintained a high performance. Moreover, this pattern could not be replicated with algorithms that did not use the image data - “logistic regression model (AUC 0·65), a random forest classifier (0·64), and an XGBoost model (0·64) to classify race on the basis of age, sex, gender, disease, and body habitus performed much worse than the race classifiers trained on imaging data”. So as they conclude, “medical AI systems can easily learn to recognise self-reported racial identity from medical images, and that this capability is extremely difficult to isolate” - the problem may be prevalent in a large range of algorithms and would be difficult to correct for. Moreover, the fact that they obtained the results by training on a variety of popular and publicly available datasets for medical images, including the MIMIC-CXR, CheXpert, National lung cancer screening trial, RSNA Pulmonary Embolism CT, and the Digital Hand Atlas, further suggests that this could largely applicable to other AI projects.

This paper is also not a standalone result. A prior paper by Yi et al. (2021) demonstrated that age and sex can be determined for Chinese and American populations. A paper by Adleberg et al. (2022), training on the MIMIC-CXR dataset, created a deep learning model that can extract self-reported information such as age, gender, race, ethnicity with high accuracies, even insurance status at moderate rates.

While the question of whether their results are reproducible has been more adequately answered elsewhere, we are interested if it possible to reproduce their results on a smaller scale. Moreover, we aim to answer the ethical implications of their work beyond the problems of bias it poses to deep neural networks. Gichoya et al. (2022) “emphasise that the ability of AI to predict racial identity is itself not the issue of importance”, but is this enough? It does not seem to be adequate to stop at this conclusion when racial classification itself is a goal that is long rooted in the painful histories of eugenics, slavery, and colonization. To this extent we will exposit some more about the definition of race and its use in medicine.

Race in Medicine

The most concerning question we face are the implications of this model. The direct usages for this algorithm are limited. However, its main value is in its demonstration; anyone using an AI algorithm may be unknowingly be using similar procedures to this one to classify self-reported and use this as a proxy for other classifcation tasks.

We cannot ignore that there still may be potential users of this algorithm. The very goal of racial classification contains an implicit assumption that race exists. However, we must address two central questions: what race represents in medicine, and how race has been used for clinical practice.

Does Race Exist?

Whether race exists as a biological phenomenon, and not as a social construct, is a hotly debated issue. As Cerdeña, Plaisime, and Tsai (2020) note, “race was developed as a tool to divide and control populations worldwide. Race is thus a social and power construct with meanings that have shifted over time to suit political goals, including to assert biological inferiority of dark-skinned populations.”

One justification for the biological reality of races is based on the assumption that different races have distinct genetics from one another, and can be fit into genetic groups. However, Maglo, Mersha, and Martin (2016) note that humans are not distinct by evolutionary criteria and genetic similarities between “human races, understood as continental clusters, have no taxonomic meaning”, with there being “tremendous diversity within groups” [2]. Whether race defines a genetic profile is therefore unclear at best, with correlations between race and disease being confounded by variables such as the association between race and socioeconomic variables (NEEDS CITATION).

What is Race-based Medicine?

It is possible that some may be interested in using this algorithm to deduce the race of an individual and use this as part of medical decisions. There are some correlations between disease prevalence and race. Maglo, Mersha, and Martin (2016) note that “Recent studies showed that ancestry mapping has been successfully applied for disease in which prevalence is significantly different between the ancestral populations to identify genomic regions harboring diseases susceptibility loci for cardiovascular disease (Tang et al. (2005)), multiple sclerosis (Reich et al. (2005)), prostate cancer (Freedman et al. (2006)), obesity (Cheng et al. (2009)), and asthma (Vergara et al. (2009))” [2].

These practices would be characteristic of race-based medicine. As Cerdeña, Plaisime, and Tsai (2020) argue, this is “the system by which research characterizing race as an essential, biological variable, [which] translates into clinical practice, leading to inequitable care” [1]. Notably, then, race-based medicine has come under heavy criticism.

The Harms of Race-based Medicine

As stated above, race is not an accurate proxy for genetics. Cerdeña, Plaisime, and Tsai (2020) note that in medical practices, race is used as an inaccurate guideline for medical care: “Black patients are presumed to have greater muscle mass …On the basis of the understanding that Asian patients have higher visceral body fat than do people of other races, they are considered to be at risk for diabetes at lower body-mass indices” [1]. As they note, race-based medicine can be founded more on racial stereotypes and generalizations rather than.

Moreover, race-based medicine can lead to ineffective treatements. Apeles (2022) summarizes the results of a study on race-based prescriptions for Black patients for high blood pressure. While this study demonstrates that alternative prescriptions for Black patients with high blood pressure have been shown to be ineffective, “Practice guidelines have long recommended that Black patients with high blood pressure and no comorbidities be treated initially with a thiazide diuretic or a calcium channel blocker (CCB) instead of an angiotensin converting enzyme inhibitor (ACEI) and/or angiotensin receptor blocker (ARB). By contrast, non-Black patients can be prescribed any of those medicines regardless of comorbidities.” In addition, the authors of the study found that “other factors may be more important than considerations of race, such as dose, the addition of second or third drugs, medication adherence, and dietary and lifestyle interventions. Follow-up care was important, and the Black patients who had more frequent clinical encounters tended to have better control of their blood pressure.”

In addition, Vyas, Eisenstein, and Jones (2020) argue that race is ill-suited as a correction factor for medical algorithms. As they found, algorithms as the American Heart Association (AHA) Get with the Guidelines–Heart Failure Risk Score, which predicts the likelihood of death from heart failure, the Vaginal Birth after Cesarean (VBAC), which predicts the risk of labor for someone with a previous cesarean section, and STONE score, which predicts the likelihood of kidney stones in patients with flank pain, all used race to change their predictions of the likelihood or morbidities. However, they find that these algorithms were not sufficiently evidence based as “Some algorithm developers offer no explanation of why racial or ethnic differences might exist. Others offer rationales, but when these are traced to their origins, they lead to outdated, suspect racial science or to biased data”. Using race can then discourage racial minorities from receiving the proper treatment based on their scores, exacerbating already existing problems of unequal health outcomes.

Conclusion

So it is clear that anyone who intends to use race for diagnosis could harm racial minority groups. Race inherently is a complex social and economic phenomenon and cannot be said to be a clear biological variable. Hence anyone intending to use or create algorithms will run the risk of creating dangerous biases in treatment; ones that could worsen the existing disparities in care for vulnerable populations.

3. Values Statement

The potential users of this project are scholars and researchers who remain adamant in exploring the classification of race through the intersection of other socially constructed identities (gender, ethnicity, sexuality, etc.). There have been numerous literatures potentially identifying race as a proxy for categorizing and describing certain social, cultural, and biological characteristics of individuals or groups; it has also become pervasive in its history and role in the medical field. Those who are harmed and still are affected by this project would be the hidden bodies — the group of individuals historically marginalized in society — and whose very identities are in a constant battle of validity. In pursuit of this project, we acknowledge that the technology and results could further harm and perpetuate the racist ideologies that currently exists in validating the physiological differences across racial groups.

4. Materials and Methods

Our Data

We used the ChexPert dataset collected by Irvin et al. (2019). The dataset contains images collected between between October 2002 and July 2017, where the dataset was eventually finalized after analyzing all images from Stanford Hospital. This dataset contains 224,316 frontal and lateral chest radiographs of 65,240 patients. Each radiograph is labeled with information such as age, gender, race, ethnicity and medical conditions, but we are primarily concerned with race and gender. A full structured datasheet of the ChexPert dataset has been explained by Garbin et al. (2021).

One limitation in the dataset includes a limited variety of x-ray devices to capture the images as the dataset is only coming from one institution — Stanford Hospital. Thus, models trained on this dataset can only be said to be valid for patients living around the Stanford area and for scans coming from this hospital. It is always a possibility that our model is specializing to some features specific to these images from Stanford, so our models may perform worse when evaluatinf on scans for different institutions.

Notice that our actual data is from Kaggle as the original 11 GB dataset with smaller images was unavailable.

The two relevant datasets include this df_patients dataframe, which contains a path to the images of the patient, their Sex (male or female), Age, and Frontal/Lateral (indicating whether their scan is from the from or side). The remaining columns are disease related and were not relevant to our analysis.

import pandas as pd
df_patients = pd.read_csv('../data/train.csv')
df_patients

	Path	Sex	Age	Frontal/Lateral	AP/PA	No Finding	Enlarged Cardiomediastinum	Cardiomegaly	Lung Opacity	Lung Lesion	Edema	Consolidation	Pneumonia	Atelectasis	Pneumothorax	Pleural Effusion	Pleural Other	Fracture	Support Devices
0	CheXpert-v1.0-small/train/patient00001/study1/...	Female	68	Frontal	AP	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	1.0
1	CheXpert-v1.0-small/train/patient00002/study2/...	Female	87	Frontal	AP	NaN	NaN	-1.0	1.0	NaN	-1.0	-1.0	NaN	-1.0	NaN	-1.0	NaN	1.0	NaN
2	CheXpert-v1.0-small/train/patient00002/study1/...	Female	83	Frontal	AP	NaN	NaN	NaN	1.0	NaN	NaN	-1.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN
3	CheXpert-v1.0-small/train/patient00002/study1/...	Female	83	Lateral	NaN	NaN	NaN	NaN	1.0	NaN	NaN	-1.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN
4	CheXpert-v1.0-small/train/patient00003/study1/...	Male	41	Frontal	AP	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
223409	CheXpert-v1.0-small/train/patient64537/study2/...	Male	59	Frontal	AP	NaN	NaN	NaN	-1.0	NaN	NaN	NaN	NaN	-1.0	0.0	1.0	NaN	NaN	NaN
223410	CheXpert-v1.0-small/train/patient64537/study1/...	Male	59	Frontal	AP	NaN	NaN	NaN	-1.0	NaN	NaN	NaN	0.0	-1.0	NaN	-1.0	NaN	NaN	NaN
223411	CheXpert-v1.0-small/train/patient64538/study1/...	Female	0	Frontal	AP	NaN	NaN	NaN	NaN	NaN	-1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
223412	CheXpert-v1.0-small/train/patient64539/study1/...	Female	0	Frontal	AP	NaN	NaN	1.0	1.0	NaN	NaN	NaN	-1.0	1.0	0.0	NaN	NaN	NaN	0.0
223413	CheXpert-v1.0-small/train/patient64540/study1/...	Female	0	Frontal	AP	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN

223414 rows × 19 columns

This df_race dataframe links the patients ID to their gender, age, primary race, and ethnicity.

df_race = pd.read_excel('../data/chexpert_race.xlsx')
df_race

	PATIENT	GENDER	AGE_AT_CXR	PRIMARY_RACE	ETHNICITY
0	patient24428	Male	61	White	Non-Hispanic/Non-Latino
1	patient48289	Female	39	Other	Hispanic/Latino
2	patient33856	Female	81	White	Non-Hispanic/Non-Latino
3	patient41673	Female	42	Unknown	Unknown
4	patient48493	Male	71	White	Non-Hispanic/Non-Latino
...	...	...	...	...	...
65396	patient65702	Male	1	Other	Hispanic/Latino
65397	patient04979	Female	27	Other	Hispanic/Latino
65398	patient11445	Female	29	Unknown	Unknown
65399	patient23235	Female	41	Other, Hispanic	Hispanic/Latino
65400	patient05143	Male	24	White	Non-Hispanic/Non-Latino

65401 rows × 5 columns

For our purposes, we combined the two dataframes to match the image links with the patient race. This was achieved by using the patient ID within each of the image paths to combine the two.

df_race['PRIMARY_RACE'].unique()

array(['White', 'Other', 'Unknown', 'White, non-Hispanic', 'Asian', nan,
       'Black or African American', 'Black, non-Hispanic',
       'Other, Hispanic', 'Race and Ethnicity Unknown',
       'Asian, non-Hispanic', 'Pacific Islander, non-Hispanic',
       'Native Hawaiian or Other Pacific Islander', 'Other, non-Hispanic',
       'Patient Refused', 'White, Hispanic', 'Black, Hispanic',
       'Asian, Hispanic', 'American Indian or Alaska Native',
       'Native American, Hispanic', 'Native American, non-Hispanic',
       'Pacific Islander, Hispanic', 'Asian - Historical Conv',
       'White or Caucasian'], dtype=object)

There are quite a few self-reported race labels, but we focused only on White, Black, and Asian. As in Gichoya’s code, we decided to label any race containing White as White (so this include White, non-Hispanic and White, Hispanic), and vice versa for Black and Asian.

Whether this approach is valid is somewhat questionable, but perhaps the actual ethnographic definitions of race are similarly ill-defined.

After combining our dataframes and removing any patients not identifying as White, Black, or Asian (including the non-reported or nan columns), we now create the following figures. As seen below, imaghes from White patients occupy the vast majority of this data and we are concerned that this may lead to a racial bias in the model’s classification algorithm.

Interestingly more images belong to male than female patients also, which could lead to gender bias in our algorithm.

Our Method

Data Subsetting

We trained our model using 10,000 frontal chest X-rays, such as the one in the following figure. The only input features are from the image itself, and the only target is race.

We created the train and test datasets by taking on the Black, White, and Asian populations. The dataset in total is 90000 images, with 72,000 in the train set and 18,000 in the test set. We created the equal proportion training set by taking the first 6000 images from each race group (this includes almost all the Black patients, the smallest group) in the training set and randomizing the order. A similar process held for the test set, but in total it is only 3000 images.

Aside from potential bias, subsetting the data makes it easier to train and evaluate an algorithm. Especially since we are not using a very large amount of data, using a large model can easily overfit to the demographics of the patient. Attempts to use the unbalanced training set essentially resulted in a algorithm that would always guess White. By using a balanced set, it is easier for the model to learn about the features without this frequency bias.

We created more training sets that also included only the frontal scans. We noticed that this is something that Gichoya et. al did. Whether it actually makes a meaningful difference is not clear, but since the lateral scans are not as common, it is possible that this hinders the models from learning.

For actual training, we only use 10,000 images due to the lack of computing power. Our training was done on the standard T4 GPUs available in Google Colab. Again, this is significant difference from Gichoya et. al, who use more than 100,000 images in their algorithm.

Image Transforms

We also used the image transforms that Gichoya et. al implmeneted. This includes an image resize to (224 by 244), image normalization (to the ImageNet mean and standard deviation), random horizontal flips, random rotations at a maximum of 15 degrees. We did not include a random zoom as in the paper however. In general, these are a good way of preventing overfitting. Some transformation like the random rotations may also prevent the model from generalizing some specific features, like a patient’s lean or posture.

Models

As for our models themselves, we used ResNet and EfficientNet. These are popular deep learning architectures for image classification, and have both achieved high accuracies of around 70-80% when evaluated on the ImageNet data base. Specifically, we used pretrained EfficientNetB0 and ResNet18 models. We chose these specific variations because they have a low number of parameters (which means faster training time and a smaller chance of overfitting). Moreover, these models are trained on the 224 by 224 image size that we used.

We also implemented some ResNet18 models on our own but achieved a lower accuracy. A problem with deep neural networks is that performance degrades with more layers. This occurs even when an identity operation is conudcted. Hence ResNet combats this issue by using skip connections, which saves the output from previous layers and adds it to the current output, avoiding this degradation.

To gauge what the ResNet18 model is “looking” at, we extracted the filters from the first convolutional layer to visualize the feature maps across three convolutional layers: layer 0, layer 8, and the last layer 16.

To optimize a model, we would train it on 10,000 images in a loop, using different learning rates for the Adam optimizer and \(\gamma\) values for the exponential scheduler. We trained all the parameters of our pretrained model. We assumed that since there is no reason that the ImageNet database would contain X-ray like images, it would be best to tune all parameters.

We note that the training process for the self-implemented Resnet18 model was different. We trained first with an exponential learning rate scheduler for 30 epochs, then 15 epochs on the learning rate plateu scheduler as the loss did not decrease otherwise.

In the same loop, we would then validate the model on 2,500 other images from the training set to find the optimal hyperparameters. Cross entropy loss was used for all models.

We define accuracy as the proportion of correct guesses, and we analyze the confusion matrices of our model.

As mentioned before, there may be a gender bias in our model because there are more male than female patients in our training dataset. We inspected this by splitting our test set into male and female counterparts and testing the model on each subset. Gender bias is then examined by looking at the score and confusion matrix for each gendered subset.

5. Results

Loss and Accuracy History

Our best models achieved an accuracy of about 70-75% on the equal testing set. Pretrained ResNet and EfficientNet models obtained similar accuracies and losses, so we will display only EfficientNet results.

After optimizing the pretrained EfficientNetB0 algorithm with an initial learning rate at 0.001 and an exponential scheduler with \(\gamma = 0.735\), we achieved an accuracy of 74% on the balanced test set.

As we can see in the following figures, the training score and loss gradually improved, while the validation score and loss plateau after a few epochs. This is likely a sign of overfitting. We tried to optimize the model by altering the scheduler type, varying the Adam learning rate from 0.001 to 0.01, but the overfitting persisted.

We also did our own implementation of ResNet18, and obtained comparable results. The learning rate was also set to 0.001, and the exponential scheduler at \(\gamma = 0.735\). The issue of overfitting remained, and our model achieved a score of 68% when tested on unseen data.

Confusion Matrix

Shown below are the confusion matrices for the EfficientNetB0 model with \(\gamma = 0.735\) over a subset of 2500 male and female patients for both frontal and lateral images. The horizontal axis represents the predicted classes (White, Black, Asian), and the vertical axis represents the true labels.

Confusion matrix for male patients	Confusion matrix for female patients

From the confusion matrices above, we notice that for male patients, the model classified White the best class, with 78% of the predictions to be true. Whereas, for the female patients, the model classified Asian the best class, with 79% of the predictions to be true. However, across all classes (White, Black, Asian), the percentage of true-positives are relatively the same and can be interpreted to be equivalent.

For male patients, 19% of the Black class and 18% of the Asian class were predicted as White. For female patients, 17% of the Black class and 15% of the Asian class were predicted as White. These rates are similar for the misclassification of White patients predicted as Asian — 12% of the male White class and 16% of the female White class were predicted as Asian. In contrast, the model has a significantly lower percentage of classifying the Asian and White class as part of the Black class, especially in falsely predicting Black as Asian and vice versa.

Thus, although we have trained on an equal subset of each class (White, Black, Asian), the confusion matrix suggests that there may be representational bias with how a certain class such as the White class will be more likely to be predicted than the Black and Asian class. Thus, though the model does classify each race at a relatively moderate accuracy (up to 80% accuracy), it may also be at the cost of producing inaccurate classifications to a patient’s true race — which is highly favored for those belonging in the White class. This can result in heavy implications if an algorithm like this were to be commercialized and probes the validity of similar algorithms that yield a “high” accuracy rates for the true races. The causes are not clear with our current analysis, but perhaps the number of distinct patients represented amongst those images is not equal - White patients on average do have more images, which could mean a smaller number of distinct patients during training.

To compare, we also show the self-implemented Resnet18 confusion matrices. Notice that the correct predictions remain the same even for the unequal testing set. Of course, the accuracy of 71% is not quite good enough to match the base accuracy of 78% (if the algorithm were to guess all White).

Self-implemented Resnet18, Equal Set	Self-implemented Resnet18, Unequal Set

Interestingly here, the correct predictions for Black and Asian are significantly worse than for EfficientNetB0. The general trend of the false predictions, while higher than in EfficientNetB0, do adhere to the similar trends discussed.

Visualized Feature Map

Using our self-implemented ResNet18 model, we demonstrate below the first convolutional layer’s filters, and the feature maps of three convolutional layers — layer 0, layer 8, and layer 16 (the last layer).

The patient used for this demonstration is a White, non-hispanic Female. Shown below is the patient’s frontal-view image used for this experiment.

Frontal-view image of patient

And here is the first convolutional layer’s filter:

First convolutional layer filter of the self-implemented ResNet-18 neural network model

Then, we passed the filter through each of the convolutional layers of the model. For simplicity, we only showed three of the total layers. It is interesting to see that there is varied noise as to what the model “looks” at, highlighted by the whiter patches of the image — which also corresponds to which part of the image is “activated”!

Feature maps from the first convolutional layer (layer 0) of self-implemented ResNet18 model

Feature maps from the ninth convolutional layer (layer 8) of self-implemented ResNet18 model

Feature maps from the last convolutional layer (layer 16) of self-implemented ResNet18 model

After displaying the three layers, we observed that the last layer is highly disoriented, and the regular human eye can no longer distinguish the original image (the patient’s frontal-view). This last layer serves high importance as it is used to deduce what the model is actually classifying based off “features” it has learned. Additionally, we also observe that the model focused on different aspects of the image as the filters used to create the feature map vary.

What exactly it is observing is unclear, perhaps we can say that it is able to distinguish between the bones and lungs of an X-ray.

6. Conclusion

The models that our project produced can classify race at around 70-75% accuracy based only on chest X-rays. Given that this is conducted on the balanced set, this seems to affirm that chest X-rays can be used to predict race.

We also investigated the ethical issues that these models could pose. We speculate that if the results of this project were used in bad faith, existing racial inequalities would be reinforced and worsened. While this is not a conclusion supported by the algorithm or any other literature, some may interpret this algorithm as evidence for racial essentialism. This would be a problematic conclusion given the ethical and practical issues of racial essentialism, and its byproduct, race-based medicine.

Right now, our model is still in its infancy, and we do not know for sure what the model is looking at to make its decision. If we had more time, the first thing we would do is test our algorithm on the MIMIC-CXR dataset. Training and validating on multiple chest X-rays would show more convincingly that medical imaging AI can detect race. With access to better resources, we could have tried to train on a larger subset of the data. Looking into regularization methods could have also helped with overfitting, as well as reducing the number of layers and thus the parameters of the model, or trying to train only on the linear layers of complex models.

References

Adleberg, Jason, Amr Wardeh, Florence X Doo, Brett Marinelli, Tessa S Cook, David S Mendelson, and Alexander Kagen. 2022. “Predicting Patient Demographics from Chest Radiographs with Deep Learning.” Journal of the American College of Radiology 19 (10): 1151–61.

Apeles, Linda. 2022. “Race-Based Prescribing for Black People with High Blood Pressure Shows No Benefit.” Patient Care.

Cerdeña, Jessica P, Marie V Plaisime, and Jennifer Tsai. 2020. “From Race-Based to Race-Conscious Medicine: How Anti-Racist Uprisings Call Us to Act.” The Lancet 396 (10257): 1125–28.

Cheng, Ching-Yu, WH Linda Kao, Nick Patterson, Arti Tandon, Christopher A Haiman, Tamara B Harris, Chao Xing, et al. 2009. “Admixture Mapping of 15,280 African Americans Identifies Obesity Susceptibility Loci on Chromosomes 5 and x.” PLoS Genetics 5 (5): e1000490.

Freedman, Matthew L, Christopher A Haiman, Nick Patterson, Gavin J McDonald, Arti Tandon, Alicja Waliszewska, Kathryn Penney, et al. 2006. “Admixture Mapping Identifies 8q24 as a Prostate Cancer Risk Locus in African-American Men.” Proceedings of the National Academy of Sciences 103 (38): 14068–73.

Garbin, Christian, Pranav Rajpurkar, Jeremy Irvin, Matthew P Lungren, and Oge Marques. 2021. “Structured Dataset Documentation: A Datasheet for CheXpert.” arXiv Preprint arXiv:2105.03020.

Gichoya, Judy Wawira, Imon Banerjee, Ananth Reddy Bhimireddy, John L Burns, Leo Anthony Celi, Li-Ching Chen, Ramon Correa, et al. 2022. “AI Recognition of Patient Race in Medical Imaging: A Modelling Study.” The Lancet Digital Health 4 (6): e406–14.

Irvin, Jeremy, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, et al. 2019. “Chexpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:590–97. 01.

Maglo, Koffi N, Tesfaye B Mersha, and Lisa J Martin. 2016. “Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research.” Frontiers in Genetics 7: 22.

Reich, David, Nick Patterson, Philip L De Jager, Gavin J McDonald, Alicja Waliszewska, Arti Tandon, Robin R Lincoln, et al. 2005. “A Whole-Genome Admixture Scan Finds a Candidate Locus for Multiple Sclerosis Susceptibility.” Nature Genetics 37 (10): 1113–18.

Seyyed-Kalantari, Laleh, Haoran Zhang, Matthew BA McDermott, Irene Y Chen, and Marzyeh Ghassemi. 2021. “Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations.” Nature Medicine 27 (12): 2176–82.

Tang, Hua, Tom Quertermous, Beatriz Rodriguez, Sharon LR Kardia, Xiaofeng Zhu, Andrew Brown, James S Pankow, et al. 2005. “Genetic Structure, Self-Identified Race/Ethnicity, and Confounding in Case-Control Association Studies.” The American Journal of Human Genetics 76 (2): 268–75.

Vergara, Candelaria, Luis Caraballo, Dilia Mercado, Silvia Jimenez, Winston Rojas, Nicholas Rafaels, Tracey Hand, et al. 2009. “African Ancestry Is Associated with Risk of Asthma and High Total Serum IgE in a Population from the Caribbean Coast of Colombia.” Human Genetics 125: 565–79.

Vyas, Darshali A, Leo G Eisenstein, and David S Jones. 2020. “Hidden in Plain Sight—Reconsidering the Use of Race Correction in Clinical Algorithms.” New England Journal of Medicine. Mass Medical Soc.

Yi, Paul H, Jinchi Wei, Tae Kyung Kim, Jiwon Shin, Haris I Sair, Ferdinand K Hui, Gregory D Hager, and Cheng Ting Lin. 2021. “Radiology ‘Forensics’: Determination of Age and Sex from Chest Radiographs Using Deep Learning.” Emergency Radiology 28: 949–54.