Hierarchical Modeling of ICD Codes
in EHR Foundation Models

Preprint · Under Review

Structure of an ICD-10-CM diagnosis code
ICD-10-CM diagnosis codes carry nested clinical structure: chapter → block → 3-character category → specific diagnosis. HICD models this hierarchy as an inductive bias.

Abstract

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen-encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach.

Contributions

Method

ICD-10 hierarchy levels (G0, G1, G2)

Each ICD-10-CM diagnosis code belongs to a chapter (G0), an officially defined block of related categories (G1), and a three-character category (G2). We treat (G0, G1, G2) ∈ {0,1}3 as eight ablation settings, where (0,0,0) is the no-hierarchy baseline.

HICD-BERT architecture
(a) HICD-BERT — additive hierarchy embeddings on a BEHRT-style encoder.
HICD-Graph architecture
(b) HICD-Graph — GCN over a diagnosis co-occurrence graph augmented with ICD-10 ontology edges, feeding a patient-level transformer.

HICD-BERT — token-level injection

Extends BEHRT. Each diagnosis token's input embedding is the sum of code, age, position, and segment embeddings plus up to three additive hierarchy embeddings, one per enabled level (1-/2-/3-character string prefix). The hierarchy-aware embedding layer is the only change to the BEHRT backbone, and the model is pretrained with masked language modeling and fine-tuned per task.

HICD-Graph — graph-level encoding

Constructs a diagnosis graph with PMI-weighted co-occurrence edges, augmented with uniform-weight hierarchy edges from each ICD code to its enabled ancestor nodes (ICD-10 chapter, official block, 2-character prefix). A two-layer GCN trained with a link-prediction objective learns hierarchy-aware code embeddings, which initialise a patient-level transformer (masked mean pooling per visit, attention pooling across visits).

Tasks & datasets

Results

In-domain — 30-day readmission

Group Hierarchy HICD-BERT F1-macro HICD-BERT AUROC HICD-Graph F1-macro HICD-Graph AUROC
BaselineNo hierarchy0.57860.68080.59760.6992
Single-levelG00.58250.6858*0.6025*0.7086*
G10.5740*0.68110.6006*0.7079*
G20.5896*0.6921*0.6005*0.7086*
Two-levelG0+G10.5696*0.6863*0.6088*0.7105*
G0+G20.5953*0.6918*0.59990.7097*
G1+G20.5834*0.6917*0.60000.7113*
All-levelsG0+G1+G20.5731*0.6923*0.59910.7114*

* p < 0.05 from pairwise DeLong tests on AUROC and paired bootstrap tests on F1-macro, both against the no-hierarchy baseline. Full tables for emergency-admission prediction and cross-dataset transfer to eICU appear in the paper.

Embedding analysis

Pairwise cosine similarity by ICD relationship type
(a) Pairwise similarity between leaf codes sharing the same G2 prefix and same chapter.
Within-group similarity across hierarchy configurations
(b) Within-group similarity across all eight hierarchy configurations; G1+G2 produces the tightest clusters.

All hierarchy-aware variants pull related ICD codes closer in embedding space, and the configurations with the tightest local clusters (G1+G2) also achieve the strongest downstream AUROC, linking representation geometry to predictive accuracy.

BibTeX

@article{thukral2026hicd,
  title   = {Hierarchical Modeling of ICD Codes in EHR Foundation Models},
  author  = {Thukral, Megha and Kang, Dong Gyun and Singh, Rudra Pratap
             and Hiremath, Shruthi Kashinath and H{\"a}nsel, Katrin and Pl{\"o}tz, Thomas},
  journal = {Preprint, Under Review},
  year    = {2026}
}