Informing 0-1 Design for Fairer Machine Learning

TLDR

(Too Long Didn’t Read)

How might we help ML developers grapple with the social impact of their models and optimize for fairness? I was the lead and solo UX researcher on a cross-functional team building a zero-to-one product for our master’s capstone at UC Berkeley. From generative through to evaluative research, I immersed in machine learning, machine learning fairness, and a b2b product designed for technical end users to inform how our team could balance engaging users while building intentional friction to adequately capture the complexity of fairness.

Egaleco won the 2023 Sarukkai Social Impact Award for the project across School of Information degree programs with the greatest potential to address important social issues and improve people’s lives.

Overview

For my master’s capstone project at the UC Berkeley School of Information, I partnered with an interdisciplinary team of ten classmates to build a new machine learning (ML) fairness toolkit named Egaleco. The team included data scientists, engineers, a product manager, two policy specialists, two UX designers, and myself, the UX researcher. The UX team produced two research driven design prototypes—a prototype to inform the MVP (minimum viable product) the technical team members built, the “design for implementation”, and a prototype to further explore users' needs with the same core product requirements but unconstrained by the semester’s engineering resources, the “design for inspiration”.

The objectives of the research were both to inform the scope and design of the product and to contribute to ongoing work in ML fairness assessments beyond our capstone. To meet these objectives, I led three rounds of research: foundational interviews and two rounds of iterative usability testing with design prototypes.

Timeline

January-April 2023

Role

Lead UX researcher

ML Fairness Expert Interviews

Which factors of an ML model’s design, development, and context can inform an effective assessment of fairness?
- How do participants define ML fairness?
- What do participants view as the ideal process of the application of ML fairness best practices in model development?
What knowledge is required for ML practitioners to conduct an effective assessment of fairness in an ML model?
- How have participants approached building expertise in ML fairness?
- How do participants approach assessing the fairness of ML models in practice?

ML Practitioner Interviews & Usability Testing

What motivates ML practitioners and their non-technical stakeholders to incorporate assessments of fairness into their ML model development processes?
What barriers do ML practitioners and their non-technical stakeholders encounter to incorporating assessments of fairness into their ML model development process?
What design principles, educational approaches, and interactions can engage ML practitioners to effectively assess the fairness of ML models?
- How do participants define ML fairness?
- How do participants approach assessing the fairness of an ML model? Why?
- What resources or tools have participants found helpful for understanding the sociotechnical context of a model, if any?
- What resources or tools have participants found helpful for facilitating technical and sociotechnical ML learning, if any?

Research Questions

Process

Based on recent research documenting the shortcomings of available tools, we scoped our project from the start to designing a resource to better support ML practitioners, people developing ML models, especially in the the health sector to assess fairness of their models.

Research Needfinding: After an initial review of existing literature, I organized a research needfinding exercise to efficiently consolidate our team’s diverse knowledge and experiences in ML fairness. I set up a Figjam template with themes related to our project’s purpose, scope, and functionality and assigned colors by sub-team focus area (technical, policy, UX). All team members contributed open questions they had, myself included, and used stickers to reinforce others’ questions. I clustered those questions by similarity, noted themes, and then copied and reorganized them into a research plan which informed our research questions and design.

Institutional Review Board (IRB) Approval: Since our project required human-subject research, I led our application for UC Berkeley’s Institutional Review Board approval. We applied for an exemption to more extensive review by committee because our research design posed low risk to our participants and our participants did not constitute a vulnerable group. However, the required training and process were a helpful review to build practices to minimize any potential risk to our participants.

Foundational Interviews: The objective of the foundational research was to formulate a definition of fairness and scope a solution to better operationalize that definition. We chose to conduct semi-structured interviews with ML fairness experts, people with extensive experience assessing fairness in ML, and ML practitioners, people with experience with ML and limited or functional exposure to ML fairness, to learn how both groups think about and approach fairness—or don’t. Interviews with both enabled us to strategize how best to close the gap between them—enable practitioners to understand and effectively apply at least some of the expertise fairness experts offer. In total, we conducted twelve one-hour interviews over Zoom, including conversations with four fairness experts and eight ML practitioners. The interview guides for fairness experts and ML practitioners varied slightly, but both included a discussion of their background and professional experience, ML fairness broadly and in their experience, effective educational resources for ML, and considering fairness in the context of health ML model scenarios we provided.

To analyze and synthesize the interview data, we conducted several rounds of affinity mapping. Affinity mapping facilitated clustering and re-clustering data to ultimately translate “raw” data into product recommendations. While listening to interview recordings and reviewing transcripts, we transferred detailed, anonymized notes and quotes to a Figjam board, following a structured template for each type of participant (i.e. fairness expert and ML practitioner). In all cases, at least two team members listened to interview recordings. From the notes, we developed two affinity maps, one for fairness experts and one for ML practitioners, with individual participants tracked by color code. We clustered notes into two descriptive levels—key topics and subtopics—and then conducted another round of review to draw out insights, interpretations of what the cluster of the notes meant for our research. Finally, we organized insights into clusters by our original research questions and co-developed product recommendations in synthesis sessions with the technical and product project leads. Through duplication of notes at each step of the analysis, our affinity mapping board enabled tracing recommendations back to insights, insights back to topic, subtopic, and data notes, and data notes back to individual interviews.

Iterative Usability Testing: Based on the foundational research findings, the UX designers created two prototypes, a “prototype for implementation” and a “prototype for inspiration”. Based on the same core product requirements, the “prototype for implementation” incorporate testing feedback quickly for engineering execution while the “prototype for inspiration” aimed to explore the user needs reflected in the testing sessions beyond the semester’s engineering limits. In the first round of usability testing, we showed both prototypes to participants to help elicit design feedback through comparing and contrasting designs. The second round of usability testing focused on the “prototype for inspiration” with the objective of generating learnings for how best to design for the core challenges of this approach to ML fairness. While the sessions surfaced a range of feedback at every step of the fairness assessment process, the tests particularly aimed to elicit an in-depth understanding of the fairness metric selection process—the step of the assessment where users must integrate the goals of their model with the mathematical approaches to measuring fairness. Building on the research questions described above, the usability testing specifically honed in on usability as defined by the ISO in terms of effectiveness, efficiency, and satisfaction.

We primarily recruited novice ML practitioners with limited ML fairness exposure to resemble the target user base for Egaelco—ML practitioners who know enough about ML fairness to seek a resource to address it but need assistance to use fairness metrics effectively. Five ML practitioners participated in each round of usability testing for a total of ten usability test participants. Each test session lasted an hour and included an introductory section to more thoroughly understand the participant’s background. I introduced participants to a sample classification model and tasked them with evaluating the fairness of the models using the design prototype(s) “thinking aloud” as they reviewed the design prototypes. In the first round of usability testing, participants tried both distinct design prototypes, but the designs did not include the full flow. In the second round of usability testing, the participants only used the revised “prototype for inspiration” but explored the complete flow.

To analyze and synthesize usability test sessions, we recorded notes and quotes while watching session recordings on a Figjam board. We colored notes according to individual participants and arranged them by key steps in the fairness assessment flow. For each step, we then summarized key themes and noted actionable learnings.

Findings & Impact

Foundational Interviews

I consolidated findings from the foundational research into four key topics: motivations, barriers, definitions of fairness, and process practicalities at each stage of conducting an ML fairness assessment. Subtopics within each informed considerations and product recommendations, some beyond the scope of our semester’s work. Most critically the foundational research findings highlighted priorities and painted a pre-mortem for our work—opportunities for failure or unintended harms. While existing research and toolkits provided clear opportunities for improving the user experience based on usability standards, the research highlighted specific points of failure at which we could risk:

Overwhelming the user with the complexity of fairness, from identifying potential attributes vulnerable to bias, such as proxies for identity groups to prioritizing which fairness metrics matter, or, in contrast,

Oversimplifying fairness and giving a false sense of confidence in fairness that does not exist, such as defining default mathematical definitions of acceptable disparities between groups or recommending a single fairness metric.

These key principles drove intentional friction in the design and usability testing to evaluate if that friction was still approachable.

Iterative Usability Testing

The usability tests elicited actionable feedback ranging from discrete interface adjustments, such as placement of the progress bar, to how much flexibility the tool should allocate to users in selecting how to measure fairness. The following highlights a few key takeaways and design decisions based on them.

Communicating Limits and Constraints

Challenge: The foundational research highlighted the importance of communicating the limits of what our tool could do and other steps and stakeholders that should be involved to ensure fairness in machine learning.
Solution: In the first design iteration, we included a “Notice and Consent” page proposed by our policy colleagues that asked users to acknowledge that their model used data that was ethically collected.
Evaluation: Most participants shared that they would not expect to have access to the information requested but would rather assume that someone earlier in the process was responsible for this step in ethical research. Further, the suggestion that they consult a research ethicist or data privacy expert felt alienating to some who would not have access to such a person. As a result, participants began their journey on the tool forced to acknowledge understanding of something they did not understand. The tool aims to support an honest reckoning with what users are doing and not doing to make their models fairer, and so we concluded forcing acknowledgement of something likely outside our target users’ scope would inhibit honesty from the start. In emphasizing just this element of ethical research, the design may also have been undercutting other aspects of fairness that should be considered in between informed consent and the tool’s assessment of model output.
Iteration: In the next iteration of the tool, we dropped the “Notice and Consent” page but acknowledged fairness considerations from the data generation phase in a “Limitations” carousel.
Evaluation: In the second round of usability testing, participants noted the constraints of the tool while expressing comfort to continue.

Fairness Metric Selection

Challenge: The background and foundational research highlighted the need for additional support in prioritizing the best fairness metrics for a model’s use case.
Solution: In the first design iteration, we included three different features for supporting metric prioritization. guidance to users with varying levels of requirement: a basic question and answer form, a robust questionnaire with visualizations and examples, and a flowchart.
Evaluation: While we thought question prompts in both designs would help users grapple with the context of their models, participants were unable to answer the questions as intended (for the sample model provided) and lacked context on what purpose the questions served as they were answering them. The flowchart provided more context, but participants did not see it unless prompted and felt the information would be overwhelming without understanding the question prompts and the menu of fairness metrics the tool offers. In the end, all five participants were unable to effectively prioritize fairness metrics for their use case.
Iteration: In the next round of design, we maintained both the the questionnaire and flowchart, but rather than forcing users through the questionnaire, we offered them as optional resources while showing the menu of metrics available. We also made the flowchart interactive, so the information would be less overwhelming. Finally, we workshopped the questionnaire language to simplify and clarify the questions and answer options.
Evaluation: In the second round of testing, participants more quickly grasped the value of the questionnaire as enabling efficiency when they saw the quantity of metrics available. Revisions to the language and visualizations in the questionnaire also made it easier to understand and correctly select answers. Four of the five completed the questionnaire and prioritized the best metrics for their use case.

Engaging in Uncertainty

Challenge: Is the model fair? The foundational interviews reinforced that there is not a simple yes or no answer. Rather we wanted to encourage users to grapple with harms and aim to make their model fairer. Our goal was to support users in making and documenting decisions to define and take responsibility for a definition of fairness for their model.
Solution: While we expected users would want Egaleco to tell them if their model is fair or not, we chose to avoid even including those terms in the metric output, the final report provided to users. Rather, we focused on supporting users through setting priorities for fairness through the metric selection fairness questionairre and documenting those decisions in the metric output, the final report page. We also included a disparity threshold slider for users to experiment with adjusting the gap between groups they would consider fair.
Evaluation: Even after iterating on the questionnaire, some participants expressed their uncertainty about the priorities for fairness they had set, including those that had selected responses effectively. While they chose responses as we intended, the questionnaire asks questions which are difficult to answer and may require additional research and consultation. The participants experience of successfully moving through the tool with a healthy dose of uncertainty may actually be ideal for capturing the complexity of fairness. Alongside that uncertainty, participants found the metric output, the final report page, to effectively flag concerns. While some were disappointed that the tool did not offer a more concrete fair or unfair decision, the report also facilitated sharing to other cross-functional stakeholders the process they had completed to work through the concerns flagged together.

Learnings

This project was the perfect opportunity to navigate what I needed to know about a technical space, machine learning, in which I had limited exposure at the start of the project to be able to answer research questions that could be helpful to both my UX colleagues and our technical ML fairness leads on the project. With limited time, I had to learn enough context to communicate effectively with our ML fairness expert and practitioner participants. In many ways, it reminded me of my experience in cross-cultural research across countries, but in this case I was integrating into ML practitioner culture.
This was the first time I was the only researcher. I vividly remember posing a question about research scope to our cross-functional team at the start of the project and registering that it was my call to make. Throughout the project, I learned to take ownership of the decisions I was best equipped to make as the lead researcher while providing opportunities for input from my colleagues as appropriate.
If I could do the project again, I would have allocated more time and energy to advocating for research findings and collaborating to translate them into design decisions. The process instilled in me the importance of getting to know one’s stakeholders and gauge how specifically they wanted me as a researcher to provide recommendations. For example, in our foundational research the importance of humanizing the data and focusing on the stories of people impacted emerged as critical, but it was secondary to designing a product with all of the functional product requirements. I wish I provided more support and ideas inspired from the research to incorporate that finding into the prototypes, especially given the tight time constraints of the semester my design colleagues were working within.