Berlin Tech Meetup: The Future of Relational Foundation Models, Systems, and Real-World Applications

Register now:
PyG/Use Case11 min read

Student Outcome: GNN on Enrollment Graphs

US colleges lose $16.5B annually to student dropout. Early warning systems flag at-risk students too late. Here is how to build a GNN that predicts outcomes using the full enrollment graph: courses, instructors, peers, and academic pathways.

PyTorch Geometric

TL;DR

  • 1Student outcome prediction is a graph classification problem. Students are embedded in a network of courses, instructors, peers, and programs. Academic and social context both drive outcomes.
  • 2HeteroConv on the enrollment graph captures course difficulty, instructor effectiveness, peer group quality, and prerequisite chain completion, all as graph signals.
  • 3On RelBench benchmarks, GNNs achieve 75.83 AUROC vs 62.44 for flat-table LightGBM. Academic context and peer effects provide 13+ points of lift.
  • 4Early-semester signals (first grades, attendance, LMS patterns) combined with graph context enable intervention within the first 3 weeks.
  • 5KumoRFM predicts student outcomes with one PQL query (76.71 AUROC zero-shot), constructing the enrollment graph from your SIS data automatically.

The business problem

US colleges and universities lose $16.5 billion annually in revenue from student dropout. The average 6-year graduation rate for 4-year institutions is only 62%. For every student who drops out, the institution loses tuition revenue and the student accumulates debt without a credential. Early identification of at-risk students (before midterms, ideally within the first 3 weeks) gives advisors time to intervene effectively.

Why flat ML fails

  • No course context: A C+ in Organic Chemistry is very different from a C+ in Introduction to Art. The course graph captures difficulty, prerequisites, and instructor effectiveness.
  • No peer effects: Students in strong study groups and cohorts outperform isolated students with identical grades. The social/academic graph captures these peer effects.
  • No pathway awareness: A student who skipped a prerequisite and is now struggling in the advanced course needs different support than one who completed the full prerequisite chain. The course graph encodes these pathways.
  • No instructor signal: Some instructors have consistently better student outcomes. The student-course-instructor graph captures this quality signal and its interaction with student backgrounds.

The relational schema

schema.txt
Node types:
  Student    (id, gpa, credits_earned, financial_aid, first_gen)
  Course     (id, department, level, avg_grade, dfw_rate)
  Instructor (id, department, tenure, avg_eval, years_exp)
  Program    (id, department, degree_type, avg_completion_rate)

Edge types:
  Student    --[enrolled_in]-->  Course     (grade, semester)
  Course     --[taught_by]-->    Instructor (semester, section)
  Course     --[prerequisite]--> Course
  Student    --[in_program]-->   Program
  Student    --[study_with]-->   Student    (shared_courses)

Students, courses, instructors, and programs form the enrollment graph. Study-with edges capture peer cohort effects.

PyG architecture: HeteroConv for enrollment data

student_model.py
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear

class StudentOutcomeGNN(torch.nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.student_lin = Linear(-1, hidden_dim)
        self.course_lin = Linear(-1, hidden_dim)
        self.instructor_lin = Linear(-1, hidden_dim)
        self.program_lin = Linear(-1, hidden_dim)

        self.conv1 = HeteroConv({
            ('student', 'enrolled_in', 'course'): SAGEConv(
                hidden_dim, hidden_dim),
            ('course', 'taught_by', 'instructor'): SAGEConv(
                hidden_dim, hidden_dim),
            ('course', 'prerequisite', 'course'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'in_program', 'program'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'study_with', 'student'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='mean')

        self.conv2 = HeteroConv({
            ('student', 'enrolled_in', 'course'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'study_with', 'student'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'in_program', 'program'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='mean')

        self.classifier = Linear(hidden_dim, 1)

    def forward(self, x_dict, edge_index_dict):
        x_dict['student'] = self.student_lin(x_dict['student'])
        x_dict['course'] = self.course_lin(x_dict['course'])
        x_dict['instructor'] = self.instructor_lin(
            x_dict['instructor'])
        x_dict['program'] = self.program_lin(x_dict['program'])

        x_dict = {k: F.relu(v) for k, v in
                  self.conv1(x_dict, edge_index_dict).items()}
        x_dict = self.conv2(x_dict, edge_index_dict)

        return torch.sigmoid(
            self.classifier(x_dict['student']).squeeze(-1))

HeteroConv aggregates course difficulty, instructor quality, peer performance, and program context. Two hops capture prerequisite chain effects.

Expected performance

  • GPA-based heuristic: ~58 AUROC
  • LightGBM (flat features): 62.44 AUROC
  • GNN (enrollment graph): 75.83 AUROC
  • KumoRFM (zero-shot): 76.71 AUROC

Or use KumoRFM in one line

KumoRFM PQL
PREDICT will_persist FOR student
USING student, enrollment, course, instructor, program

One PQL query. KumoRFM constructs the enrollment graph from your SIS data and predicts student persistence.

Frequently asked questions

Why use GNNs for student outcome prediction?

Student outcomes depend on context: which courses they take, who teaches them, which study groups they join, and how their peer cohort performs. A student in a high-performing study group with an experienced instructor has different outcome probabilities than one with identical grades but isolated circumstances. GNNs capture this relational context.

What graph structure represents student outcomes?

Students, courses, instructors, and programs form a heterogeneous graph. Edges connect students to courses they enrolled in, courses to instructors, students to study groups or peer cohorts, and courses to prerequisite chains. The graph captures both academic pathways and social context.

How do you predict at-risk students early enough to intervene?

Use enrollment data and early-semester signals (first assignment grades, attendance, LMS login patterns) as temporal node features. The GNN identifies at-risk students by comparing their early trajectory to the graph neighborhood of similar students who previously succeeded or dropped out.

Can GNNs handle the small dataset sizes typical in education?

Education datasets are often small (thousands of students per institution). The graph structure helps by enabling knowledge transfer: patterns from similar students, courses, and instructors provide additional signal. Pre-training on larger multi-institution datasets and fine-tuning on local data also helps.

How does KumoRFM handle student outcome prediction?

KumoRFM takes your student information system data (students, enrollments, courses, grades, instructors) and predicts outcomes with one PQL query. It constructs the enrollment graph and captures academic pathway patterns automatically.

Learn more about graph ML

PyTorch Geometric is the open-source foundation for graph neural networks. Explore more layers, concepts, and production patterns.