Responsible AIApril 5, 20265 min readBy Olatunde Adedeji

Designing the Data Model Behind Fairness Monitoring

A short walkthrough of the schema and layered architecture behind production fairness monitoring.

Introduction

When people talk about AI fairness, they often jump straight to metrics, policy, or evaluation reports.

I think the more practical question comes earlier:

How should the data be structured so unfairness stays visible in production?

That was the real design challenge here. If the schema cannot preserve subgroup behavior clearly, the dashboard may look polished while hiding the exact disparities teams need to investigate.

Starting with the architecture

The system became much easier to design once I split it into four layers:

Monitoring for overall model health
Fairness for subgroup performance
Alerting for threshold breaches
Incident workflow for response and resolution

Each layer answers a different question.

Monitoring asks whether the model is healthy overall.
Fairness asks who may be experiencing it differently.
Alerting asks what now deserves attention.
Incident workflow asks what happened after the issue was seen.

That separation matters because fairness should not be a side calculation added at the end. It needs its own place in the architecture.

Keeping the deployed model at the center

The schema starts with the deployed AI model as the anchor entity. Every important operational record connects back to it: monitoring history, fairness history, and alert history.

python
class AIModel(Base):
    __tablename__ = "ai_models"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String(255), nullable=False)

    metrics = relationship("MonitoringMetric", back_populates="model")
    fairness_metrics = relationship("FairnessMetric", back_populates="model")
    alerts = relationship("Alert", back_populates="model")

I like this pattern because it keeps the system easy to navigate. If you want to inspect drift, subgroup disparity, or prior incidents, you start from the same model record.

Separating model health from subgroup behavior

One design choice mattered more than most others: do not collapse monitoring and fairness into one generic metrics table.

The usual monitoring table stores overall health signals such as accuracy, drift, latency, missing data, and prediction volume. Those metrics describe the model as a whole.

Fairness needed a different structure.

python
class FairnessMetric(Base):
    __tablename__ = "fairness_metrics"

    model_id = Column(Integer, ForeignKey("ai_models.id"), nullable=False)
    timestamp = Column(DateTime, nullable=False)
    segment_name = Column(String(100), nullable=False)
    segment_value = Column(String(100), nullable=False)
    accuracy = Column(Float)
    error_rate = Column(Float)
    false_positive_rate = Column(Float)
    false_negative_rate = Column(Float)
    prediction_count = Column(Integer)

This design keeps subgroup context intact.

With segment_name and segment_value, the same table can represent:

age_group → 65+
gender → female
insurance_type → uninsured

That flexibility is important because fairness is always contextual. Different healthcare use cases may care about different subgroup dimensions. The schema needs to support that without becoming messy.

Avoiding the one-score trap

A single fairness score sounds convenient, but it usually hides the information teams actually need.

If a model underperforms for one group, operators need to know:

which group is affected
whether the issue is getting worse
whether the disparity is driven by false positives or false negatives
how many predictions are involved

A single number rarely answers those questions well. Storing fairness at the subgroup level does.

That is the real value of the schema: it preserves the detail needed for later investigation.

Treating alerts as records, not pop-ups

Alerts also needed their own model.

It would have been easy to treat alerts as temporary notifications. I think that would weaken the system. Fairness issues are not just UI events. They are operational events, so they need history.

python
class Alert(Base):
    __tablename__ = "alerts"

    model_id = Column(Integer, ForeignKey("ai_models.id"), nullable=False)
    metric_name = Column(String(100), nullable=False)
    threshold = Column(Float, nullable=False)
    actual_value = Column(Float, nullable=False)
    severity = Column(String(20), nullable=False)
    status = Column(String(20), default="active")
    acknowledged_by = Column(String(255))
    acknowledged_at = Column(DateTime)
    notes = Column(Text)

This turns an alert into more than a warning badge. It becomes a record of what crossed a threshold, who picked it up, and how the team responded.

That matters because fairness monitoring without response history is incomplete.

Adding a clean API layer

The database models are only one part of the design. The API layer shapes how that data is exposed.

I like using Pydantic here because it separates storage from presentation. The database stays normalized, while the API can return cleaner response objects for dashboards and summaries.

That is especially useful for computed views, where the frontend needs one clear response rather than several low-level tables stitched together.

Keeping the setup practical

The database setup uses SQLite for local development and PostgreSQL for production, with SQLAlchemy abstracting the difference.

That is a simple choice, but a useful one. It keeps local iteration light while still supporting a more realistic deployment path later. For systems like this, practicality matters. If the platform is hard to run or hard to evolve, it becomes harder to improve fairness operations over time.

Final Thoughts

What I like most about this architecture is that it is clear without being rigid.

Monitoring and fairness stay separate because they answer different questions. Alerts bridge those concerns when something crosses a threshold. Incident workflow gives the system memory. And the model-centered relationships keep everything tied back to the deployed AI system.

That is why I see the schema as the real backbone of fairness monitoring.

The dashboard may be what people notice first. But the data model is what determines whether the system can actually preserve subgroup context, surface meaningful disparities, and support useful response when something changes.

The full source code for FairWatch is available on GitHub: FairWatch