The business problem
US colleges and universities lose $16.5 billion annually in revenue from student dropout. The average 6-year graduation rate for 4-year institutions is only 62%. For every student who drops out, the institution loses tuition revenue and the student accumulates debt without a credential. Early identification of at-risk students (before midterms, ideally within the first 3 weeks) gives advisors time to intervene effectively.
Why flat ML fails
- No course context: A C+ in Organic Chemistry is very different from a C+ in Introduction to Art. The course graph captures difficulty, prerequisites, and instructor effectiveness.
- No peer effects: Students in strong study groups and cohorts outperform isolated students with identical grades. The social/academic graph captures these peer effects.
- No pathway awareness: A student who skipped a prerequisite and is now struggling in the advanced course needs different support than one who completed the full prerequisite chain. The course graph encodes these pathways.
- No instructor signal: Some instructors have consistently better student outcomes. The student-course-instructor graph captures this quality signal and its interaction with student backgrounds.
The relational schema
Node types:
Student (id, gpa, credits_earned, financial_aid, first_gen)
Course (id, department, level, avg_grade, dfw_rate)
Instructor (id, department, tenure, avg_eval, years_exp)
Program (id, department, degree_type, avg_completion_rate)
Edge types:
Student --[enrolled_in]--> Course (grade, semester)
Course --[taught_by]--> Instructor (semester, section)
Course --[prerequisite]--> Course
Student --[in_program]--> Program
Student --[study_with]--> Student (shared_courses)Students, courses, instructors, and programs form the enrollment graph. Study-with edges capture peer cohort effects.
PyG architecture: HeteroConv for enrollment data
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear
class StudentOutcomeGNN(torch.nn.Module):
def __init__(self, hidden_dim=64):
super().__init__()
self.student_lin = Linear(-1, hidden_dim)
self.course_lin = Linear(-1, hidden_dim)
self.instructor_lin = Linear(-1, hidden_dim)
self.program_lin = Linear(-1, hidden_dim)
self.conv1 = HeteroConv({
('student', 'enrolled_in', 'course'): SAGEConv(
hidden_dim, hidden_dim),
('course', 'taught_by', 'instructor'): SAGEConv(
hidden_dim, hidden_dim),
('course', 'prerequisite', 'course'): SAGEConv(
hidden_dim, hidden_dim),
('student', 'in_program', 'program'): SAGEConv(
hidden_dim, hidden_dim),
('student', 'study_with', 'student'): SAGEConv(
hidden_dim, hidden_dim),
}, aggr='mean')
self.conv2 = HeteroConv({
('student', 'enrolled_in', 'course'): SAGEConv(
hidden_dim, hidden_dim),
('student', 'study_with', 'student'): SAGEConv(
hidden_dim, hidden_dim),
('student', 'in_program', 'program'): SAGEConv(
hidden_dim, hidden_dim),
}, aggr='mean')
self.classifier = Linear(hidden_dim, 1)
def forward(self, x_dict, edge_index_dict):
x_dict['student'] = self.student_lin(x_dict['student'])
x_dict['course'] = self.course_lin(x_dict['course'])
x_dict['instructor'] = self.instructor_lin(
x_dict['instructor'])
x_dict['program'] = self.program_lin(x_dict['program'])
x_dict = {k: F.relu(v) for k, v in
self.conv1(x_dict, edge_index_dict).items()}
x_dict = self.conv2(x_dict, edge_index_dict)
return torch.sigmoid(
self.classifier(x_dict['student']).squeeze(-1))HeteroConv aggregates course difficulty, instructor quality, peer performance, and program context. Two hops capture prerequisite chain effects.
Expected performance
- GPA-based heuristic: ~58 AUROC
- LightGBM (flat features): 62.44 AUROC
- GNN (enrollment graph): 75.83 AUROC
- KumoRFM (zero-shot): 76.71 AUROC
Or use KumoRFM in one line
PREDICT will_persist FOR student
USING student, enrollment, course, instructor, programOne PQL query. KumoRFM constructs the enrollment graph from your SIS data and predicts student persistence.