Responsible and Trustworthy Data Science - Evaluation framework

The evaluation framework below is the result of the participatory work initiated in the spring of 2019 by Labelia Labs (ex- Substra Foundation) and ongoing since then. It is based on the identification of the risks that we are trying to prevent by aiming for a responsible and trustworthy practice of data science, and best practices to mitigate them. It also brings together for each topic technical resources that can be good entry points for interested organisations.

Last update: 1st semester 2023.

Evaluation framework to assess the maturity of an organisation

The evaluation is composed of the following 6 sections:

Section 1 - Protecting personal or confidential data
Section 2 - Preventing bias, developing non-discriminatory models
Section 3 - Assessing model performance rigorously
Section 4 - Ensuring model reproducibility and establishing the chain of accountability
Section 5 - Using models responsibly and in confidence
Section 6 - Anticipating, monitoring and minimising the negative externalities of data science activities

Section 1 - Protecting personal or confidential data and comply with regulatory requirements

[Data privacy and regulatory compliance]

The use of personal or confidential data carries the risk of exposure of such data, which can have very detrimental consequences for the producers, controllers or subjects of such data. Particularly in data science projects, they must therefore be protected and the risks of their leakage or exposure must be minimised. Additionnally, AI models themselves can be attacked and must be protected. Finally, regulatory requirements specific to AI systems but be identified, known, and the data science activities of the organization must be compliant.

[⇧ back to the list of sections]
[⇩ next section]

Q1.1 : Applicable legislation and contractual requirements - Identification
With regard to personal or confidential data, the legal, statutory, regulatory and contractual requirements in force and concerning your organisation are:

R1.1 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

1.1.a Not yet identified
1.1.b Partially identified or in the process of identification
1.1.c Identified
1.1.d Identified and known by our collaborators
1.1.e Identified, documented and known by our collaborators

Expl1.1 :

It is crucial to put in place processes to know and follow the evolution of applicable regulations (very specific in certain fields, for example in the banking sector), as well as to document the approaches and choices made to comply with each data science project. Interesting example(s) : Welfare surveillance system violates human rights, Dutch court rules.

Resources1.1 :

(Official report) Big data, artificial intelligence, machine learning and data protection, EU Information Commissioner's Office, 2017
(Web article) Artificial Intelligence and the GDPR: how do they interact, Mathias Avocats, November 2017
(Web article) How to develop Artificial Intelligence that is GDPR-friendly, Tech GDRP blog, February 2019
(Video) What is the impact of GDPR on AI and Machine Learning, SwissAI Machine Learning Meetup, September 2018
(Technical guide) L'Atelier RGPD, online training offered by the CNIL (French data protection authority), in French

Q1.2 : Applicable legislation and contractual requirements - Compliance approach
In order to meet these requirements, the approach adopted by your organisation is:

R1.2 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

1.2.a Informal, based on individual responsibility and competence
1.2.b Formalized and accessible to all collaborators
1.2.c Formalized and known by collaborators
1.2.d Formalized, known by employees, documented for each processing of personal or confidential data

Expl1.2 :

It is a question of questioning the management of personal or confidential data (storage, access, transfer, protection, responsibilities...), and documenting the choices made.

Q1.3 : Applicable legislation and contractual requirements - Regulatory surveillance
Is a regulatory surveillance process in place, either internally or via a specialised service provider, to find out about applicable changes that have an impact on your organisation?

R1.3 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

1.3.a We do not really monitor the regulatory environment
1.3.b We keep an informal watch, each employee sends back information via internal communication channels
1.3.c We have a formal surveillance, with identified collaborators in charge and a documented process

Expl1.3 :

In addition to identifying regulations and compliance approaches, it is important to set up a surveillance processe to know and follow the evolution of applicable regulations (which can be very specific in certain sectors). Interesting example(s) : Welfare surveillance system violates human rights, Dutch court rules.

Ressources1.3 :

(Academic paper) Do Foundation Model Providers Comply with the Draft EU AI Act?, Rishi Bommasani and Kevin Klyman and Daniel Zhang and Percy Liang (Stanford University, Center for Research on Foundation Models), June 2023

Q1.4 : Applicable legislation and contractual requirements - Auditing and certification
Has the organisation's compliance with personal and confidential data requirements been audited and is it recognised by a certification, label or equivalent?

R1.4 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

1.4.a Yes
1.4.b No
1.4.c We are currently preparing an upcoming audit or certification of our organisation's compliance with personal and confidential data requirements
1.4.d Not at the organization level, but it is the case for at least one project

Expl1.4 :

In many sectors there are specific compliance requirements. It is generally possible to formalise an organisation's compliance through certification or a specialised audit, or by obtaining a label (e.g. AFAQ "Protection des données personnelles", ISO 27701).

Q1.5 : Data minimisation principle
In data science projects, the data minimisation principle should guide the collection and use of personal or confidential data. How is it implemented in your organisation?

R1.5 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: use of personal or confidential data)

1.5.a We take care not to use any personal or confidential data. We are not concerned by this risk area
1.5.b We need to use personal or confidential data in certain projects and the data minimisation principle is then systematically applied
1.5.c Employees are aware of the data minimisation principle and generally apply it
1.5.d The "who can do the most can do the least" reflex with regard to data still exists here and there within our organisation. In some projects, we keep datasets that are much richer in personal and confidential data than what is strictly useful to the project
1.5.e Employees are aware of the data minimisation principle, but it is not applied as a general standard. However, we give a particular attention to implementing personal data-related risk mitigation measures (i.e. pseudonymising some features by identifiers with a separate correspondence table, split datasets in multiple tables kept apart)

Expl1.5 :

The data minimisation principle is sometimes also referred to as "privacy by design". It is one of the pillars of the RGPD in the European Union.

The following elements within this section apply only to organisations that did not select the first response to R1.5. Organisations not concerned are therefore invited to move on to Section 2.

Q1.6 : Project involving new processing of personal or confidential data
(Condition: R1.5 <> 1.5.a)
For each processing of personal or confidential data required in the framework of a data science project within your organisation:

R1.6 :
(Type: multiple responses possible)
(Select all the answer items that correspond to practices in your organisation)

1.6.a We elaborate a Privacy Impact Assessment (PIA)
1.6.b We implement data protection measures (in particular concerning the transfer, storage and access to the data concerned)
1.6.c We contractualise relations with suppliers and customers and the responsibilities that arise from them
1.6.d We have not yet set up an organised approach to these subjects

Expl1.6 :

The Privacy Impact Assessment (PIA) is a method for assessing the impact of a data processing, similar to traditional risk assessment methods. In certain cases, for example where a processing operation presents high risks to the rights and freedoms of natural persons, the GDPR makes it obligatory to carry out an PIA before the processing operation is carried out.

Q1.7 : Machine Learning security - Knowledge level
(Condition: R1.5 <> 1.5.a)
Machine Learning security (ML security) is a constantly evolving field. In some cases, AI models learned from confidential data may reveal elements of that confidential data (see articles cited in resources). Within your organisation, the general level of knowledge of collaborators working on data science projects about vulnerabilities related to ML models and the techniques to mitigate them is:

R1.7 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

1.7.a Complete beginner
1.7.b Basic
1.7.c Confirmed
1.7.d Expert

Expl1.7 :

The state of the art in ML security is constantly evolving. If data scientists are now familiar in general with the membership inference attack (see proposed resources), new ones are being published regularly. While it is impossible to guard against all vulnerabilities at all times, it is crucial to be aware of them and to keep a watch on them. The article Demystifying the Membership Inference Attack is for example an interesting entry point in the context of sensitive data.

Ressources1.7 :

(Software & Tools) AI security risk assessment using Counterfit, Microsoft, May 2021 : Counterfit is an open source tool enabling testing different attacks on ML models to identify their vulnerabilities. Link to GitHub repo
(Technical guide) Privacy Enhancing Technologies Decision Tree (v2), Private AI, 2020
(Web article) The secret-sharer: evaluating and testing unintended memorization in neural networks, A. Colyer, 2019
(Academic paper) Membership Inference Attacks against Machine Learning Models, R. Shokri, M. Stronati, C. Song, V. Shmatikov, 2017
(Software & Tools) ML Privacy Meter: a tool to quantify the privacy risks of machine learning models with respect to inference attacks.
(Web article) Demystifying the membership inference attack, Disaitek, 2019
(Academic paper) Inverting Gradients - How easy is it to break privacy in federated learning, J. Geiping, H. Bauermeister, H. Dröge, M. Moeller, 2020
(Web article) Top Five ML risks, OWASP
(Software & Tools) Tools for differential privacy: Google differential privacy library, and the Python PyDP wrapper from OpenMined
(Software & Tools) OpenDP: a community effort to build trustworthy, open-source software tools for statistical analysis of sensitive private data. Offers the rigorous protections of differential privacy for the individuals who may be represented in confidential data and statistically valid methods of analysis for researchers who study the data
(Software & Tools) Opacus: a Facebook Open Source project, to enable training PyTorch models with Differential Privacy
(Web article) The distillation of a model, in addition to the compression it provides, can be used as a measure to protect the model and the training data used, see for example Knowledge Distillation: Simplified, Towards Data Science, 2019.
(Academic paper) Distilling the Knowledge in a Neural Network, G. Hinton, O. Vinyals, J. Dean, 2015
(Web article) Model distillation and privacy, Labelia Labs (ex- Substra Foundation) blog post to introduce distillation approaches, Gijs Barmentlo, 2020
(Web article) Never a dill moment: Exploiting machine learning pickle files, Trail of Bits, March 2021: exposition of a vulnerability of ML models using pickle for objects storage
(Academic paper) Reconstructing Training Data from Trained Neural Networks, N. Haim, G. Vardi, G. Yehudai, O. Shamir, M. Irani, June 2022

Q1.8 : Machine Learning security - Implementation
(Condition: R1.5 <> 1.5.a)
Still on the subject of vulnerabilities related to ML models and techniques to mitigate them:

R1.8 :
(Type: multiple responses possible)
(Select all the answer items that correspond to practices in your organisation)

1.8.a We keep a technical watch on the main attacks and measures to mitigate them
1.8.b Employees receive regular information and training to help them develop their skills in this area
1.8.c In some projects, we implement specific techniques to reduce the risks associated with the models we develop (for example: differential privacy, distillation, etc.)
1.8.d On each project, the vulnerabilities that apply to it and the techniques implemented are documented (e.g. in the lifecycle documentation of each model, see Section 4 and Element 4.1 for more information on this concept)
1.8.e We have not yet set up an organised approach to these subjects

Expl1.8 :

Depending on the level of risk and sensitivity of the projects, certain technical approaches to guard against them will be selected and implemented. It is important to follow the evolution of research and state-of-the-art practices, and to document the choices made, to constitute a model lifecycle documentation.

Resources1.8 :

(Software & Tools) AI security risk assessment using Counterfit, Microsoft, May 2021 : Counterfit is an open source tool enabling testing different attacks on ML models to identify their vulnerabilities. Link to GitHub repo
(Technical guide) Privacy Enhancing Technologies Decision Tree (v2), Private AI, 2020
(Web article) The secret-sharer: evaluating and testing unintended memorization in neural networks, A. Colyer, 2019
(Academic paper) Membership Inference Attacks against Machine Learning Models, R. Shokri, M. Stronati, C. Song, V. Shmatikov, 2017
(Software & Tools) ML Privacy Meter: a tool to quantify the privacy risks of machine learning models with respect to inference attacks.
(Web article) Demystifying the membership inference attack, Disaitek, 2019
(Academic paper) Inverting Gradients - How easy is it to break privacy in federated learning, J. Geiping, H. Bauermeister, H. Dröge, M. Moeller, 2020
(Web article) Top Five ML risks, OWASP
(Software & Tools) Tools for differential privacy: Google differential privacy library, and the Python PyDP wrapper from OpenMined
(Software & Tools) OpenDP: a community effort to build trustworthy, open-source software tools for statistical analysis of sensitive private data. Offers the rigorous protections of differential privacy for the individuals who may be represented in confidential data and statistically valid methods of analysis for researchers who study the data
(Software & Tools) Opacus: a Facebook Open Source project, to enable training PyTorch models with Differential Privacy
(Web article) The distillation of a model, in addition to the compression it provides, can be used as a measure to protect the model and the training data used, see for example Knowledge Distillation: Simplified, Towards Data Science, 2019.
(Academic paper) Distilling the Knowledge in a Neural Network, G. Hinton, O. Vinyals, J. Dean, 2015

Q1.9 : Notification of safety incidents to the regulatory authorities
(Condition: R1.5 <> 1.5.a)
In the event that a model that the organisation has developed is used or accessible by one or more external stakeholders, and a new vulnerability is published, there is a risk that it may apply to them and thus create a risk of exposure of personal or confidential data:

R1.9 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

1.9.a We have not yet put in place a procedure for such cases
1.9.b We have a process describing the course of action in such cases
1.9.c We have a process describing the course of action in such cases, which references the authorities to whom we must report
1.9.d We have a process describing the course of action in such cases, which references the authorities to whom we must report, and which includes communication to the stakeholders of whom we have contact details

Expl1.9 :

In some sectors there are obligations to report safety incidents to the regulatory authorities (e.g. in France: CNIL, ANSSI, ARS, etc.). An interesting entry point: Notifications of safety incidents to regulatory authorities: how to organise and who to contact on the CNIL website.

Section 2 - Preventing bias, developing non-discriminatory models

[Biases and discrimination]

The use of AI models learned from historical data can be counterproductive when historical data are contaminated by problematic phenomena (e.g. quality of certain data points, non-comparable data, social phenomena undesirable due to the time period, etc.). A key challenge for responsible and trustworthy data science is to respect the principle of diversity, non-discrimination and equity (described for example in section 1.5 of the EU Ethics Guidelines for Trustworthy AI). It is therefore essential to question this risk and to study the nature of the data used, the conditions under which they were produced and collected, and what they represent. Among other things, in some cases a specification of the equity sought between populations must also be defined. The equity of a model can be defined in several ways that may be inconsistent with each other, and the interpretation of performance scores must therefore be made within the framework of one of these definitions.

[⇧ back to the list of sections]
[⇩ next section]

Q2.1 : Gathering and assembling data samples into training and validation datasets
Often an initial phase of data science projects consists in gathering and assembling data samples intro training and validation datasets. In many cases this presents difficulties and is a source of risks. About this particular activity, has your organization defined, documented and operationalised an approach or a method taking into account in particular the following:

R2.1 :
(Type: multiple responses possible)
(Select all the answer items that correspond to practices in your organisation)

2.1.a We operate informally on this subject and rely on the practices of each collaborator involved
2.1.b Our approach includes methods to prevent poisoning attacks when collecting and gathering data samples
2.1.c Our approach includes methods to check and make sure when necessary that datasets include samples of rare events
2.1.d Our approach includes methods to complete missing values in datasets
2.1.e Our approach includes methods to handle erroneous or atypical data samples values

Expl2.1 :

Obtaining and preparing datasets is a core acitivity in every data science project. Each data point can have an impact on the learning, and it is thus crucial to define and implement a conscious, coherent, concerted approach to mitigate the risk of learning and testing on problematic datasets.

Resources2.1 :

(Technical guide) Tour of Data Sampling Methods for Imbalanced Classification
(Software & Tools) Pandas Profiling: Create HTML profiling reports from pandas DataFrame objects. The pandas df.describe() function is great but a little basic for extensive exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis

Q2.2 : Analysis of the training data
Within data science projects and when developing training datasets, reflection and research on problematic phenomena (e.g. quality of certain data points, data that are not comparable due to recording tools or processes, social phenomena that are undesirable due to time, context, etc.) can be crucial to prevent bias that undermines the principle of non-discrimination, diversity and equity. Your organisation:

R2.2 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

2.2.a Operates informally on this subject and relies on the practices of each collaborator involved
2.2.b Does not have a documented approach to the subject, but the collaborators involved are trained on the risks and best practices on the subject
2.2.c Has a documented approach that is systematically implemented

Expl2.2 :

It is a question of ensuring that oneself considers these subjects and therefore questions the training data, the way in which it was produced, etc. For example:

sensors or capture bias, e.g. if sensors used to get and record data points are not identical all along the capture process and lifecycle, or inbetween controlled training data and real data;
paying special attention to data labels and annotations: how where they generated? what level of quality, reliability? who are the authors of these annotations or labels? Labels have to be coherent with the modelling objectives and the intended domain of use of the model.

Resources2.2 :

(Web article) Hidden Bias explorable from PAIR
(Technical guide) Tour of Data Sampling Methods for Imbalanced Classification
(Software & Tools) Pandas Profiling: Create HTML profiling reports from pandas DataFrame objects. The pandas df.describe() function is great but a little basic for extensive exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis

Q2.3 : Evaluation of the risk of population bias and discrimination against certain social groups
In the context of data science projects, the nature of the project, the data used for the project and/or the thematic environment of the project can foster a risk of population bias against certain social groups (gender, origin, age, etc.). Evaluating first for each project if it is subject or not to such a risk seems key (in which case mitigation measures can be then contemplated). On that topic, your organisation:

R2.3 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: discrimination against certain social groups)

2.3.a Operates informally and relies on the practices of each collaborator involved to evaluate if there is a risk
2.3.b Does not have a documented approach to the subject, but the collaborators involved are knowledgeable and trained on the risks and best practices on the subject
2.3.c Has a documented approach that is systematically implemented to evaluate this type of risk

Expl2.3 :

Configurations with risks of potential discriminations against social groups are particularly sensitive for the organisation and its counterparts. It requires special attention and the use of specific methodologies. In certain cases it is obvious if this risk has to be considered or not (e.g. projects on behavioral data on a population of users or customers, vs. projects on oceanographic or astronomical data), whereas in some cases it might be less obvious. It is therefore important to consider the question for each project.

Q2.4 : Preventing population bias and discrimination
(Condition: R2.3 <> 2.3.b)
In cases where the AI models your organisation develops are used in thematic environments where there is a risk of population bias or discrimination against certain social groups (gender, origin, age, etc.):

R2.4 :
(Type: multiple responses possible)
(Select all the answer items that correspond to practices in your organisation)
(Specific risk domain: discrimination against certain social groups)

2.4.a We are not involved in cases where AI models are used in thematic environments with risks of population bias or discrimination against certain social groups (gender, origin, age, etc.) | (Concerned / Not concerned)
2.4.b We pay special attention to the identification of protected attributes and their possible proxies (e.g. studying one by one the variables used as model inputs to identify the correlations they might have with sensitive data)
2.4.c We carry out evaluations on test data from different sub-populations in order to identify possible problematic biases
2.4.d We select and implement one or more justice and equity measure(s) (fairness metrics)
2.4.e We use data augmentation or re-weighting approaches to reduce possible biases in the data sets
2.4.f The above practices that we implement are duly documented and integrated into the model lifecycle documentation of the models concerned
2.4.g We have not yet put in place any such measures

Expl2.4 :

It is a question of systematically questioning, for each data science project and according to the objective and target use of the model that one wants to develop, the features that may directly or indirectly be the source of a risk of population bias discriminatory bias. The term "protected attribute" or "protected variable" is used to refer to attributes whose values define sub-populations at risk of discrimination. Complement on the use of synthetic data and data augmentation, re-weighting approaches in order to reduce possible biases in the data sets: when such techniques are used it is important to make them explicit, otherwise there is a risk of losing information on how a model was developed.

Resources2.4 :

(Web article) Unfair biases in Machine Learning: what, why, where and how to obliterate them, blog ML Security, P. Irolla, April 2020
(Web article) Awful AI, a registry of worrying AI services or projects, David Dao
(Technical guide) A Tutorial on Fairness in Machine Learning, Towards Data Science blog, Z. Zhong, October 2018
(Web article) Measuring fairness explorable, PAIR
(Software & Tools) AI Fairness 360: an open source software toolkit that can help detect and remove bias in machine learning models, IBM
(Software & Tools) Giskard AI: open source library to inspect & test AI models, monitor models in production, report unexpected predictions
(Academic paper) Fairness metrics : counterfactual fairness
(Academic paper) Fairness metrics : adversarial debiaising
(Technical guide) Book Fair ML : Fairness and machine learning - Limitations and opportunities, Solon Barocas, Moritz Hardt, Arvind Narayanan, December 2019
(Web article) Fairness in Machine Learning, introduction to Fairness metrics on Labelia Labs (ex- Substra Foundation)'s blog, Mickael Fine, 2020

Q2.5 : Links between modelisation choices and bias
(Condition : R2.3 <> 2.3.b)
Recent work has shown the role that modeling and learning choices can play in the formation of discriminatory bias. Differential privacy, compression, the choice of the learning rate, early stopping mechanisms for example can have disproportionate impacts on certain subgroups. Within your organisation, the general level of knowledge of collaborators working on data science projects on this topic is:

R2.5 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: discrimination against certain social groups)

2.5.a We are not involved in cases where AI models are used in thematic environments with risks of discrimination against certain social groups (gender, origin, age, etc.) | (Concerned / Not concerned)
2.5.b Complete beginner
2.5.c Basic
2.5.d Confirmed
2.5.e Expert

Expl2.5 :

If datasets used to train and evaluate a model require a particular attention to prevent discriminatory biases, recent work shows that modeling choices have to be taken into account too. The article "Moving beyond “algorithmic bias is a data problem”" suggested in resources synthesizes very well how the learning algorithm, the model structure, adding or not differential privacy, compression, etc. can have consequences on the fairness of a model. Extracts:

A key reason why model design choices amplify algorithmic bias is because notions of fairness often coincide with how underrepresented protected features are treated by the model

[...] design choices to optimize for either privacy guarantees or compression amplify the disparate impact between minority and majority data subgroups

[...] the impact of popular compression techniques like quantization and pruning on low-frequency protected attributes such as gender and age and finds that these subgroups are systematically and disproportionately impacted in order to preserve performance on the most frequent features

[...] learning rate and length of training can also disproportionately impact error rates on the long-tail of the dataset. Work on memorization properties of deep neural networks shows that challenging and underrepresented features are learnt later in the training process and that the learning rate impacts what is learnt. Thus, early stopping and similar hyper-parameter choices disproportionately and systematically impact a subset of the data distribution.

These topics require a strong expertise and few practitioners are familiar with them yet. In the context of this evaluation element, the recommendation is to learn about them and become aware of the complex trade-offs it implies, consider them during concrete projects rather than hiding them away, and follow how the state-of-the-art evolves and what best practices emerge.

Resources2.5 :

(Academic paper) Moving beyond “algorithmic bias is a data problem”, Sara Hooker, Opinion, April 2021
(Academic paper) Algorithmic Factors Influencing Bias in Machine Learning, W. Blanzeisky, P. Cunningham, April 2021: The authors defines 4 types of algorithmic choices : Data description (for the first version on the model, and feature engineering), Irreductible Errors, Impact of regularization (present in DL or more classical ML), Impact of class & feature imbalance. Those 4 types of choices will generate what they call underestimation bias, opposed to negative latency, bias due to data (that can be due to an under-representative dataset, or other reasons). They also propose some mitigation process.
(Academic paper) Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, A. Kalai, 2016

Section 3 - Assessing model performance rigorously

[Performance evaluation]

The performance of the models is crucial for their adoption in products, systems or processes. Performance evaluation must therefore be rigorous.

[⇧ back to the list of sections]
[⇩ next section]

Q3.1 : Separation of test datasets
In data science projects and when developing test datasets, it is of utmost importance to ensure non-contamination by training data. Your organisation:

R3.1 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)

3.1.a Operates informally on this subject and relies on the competence and responsibility of the collaborators involved
3.1.b Has a documented and systematically implemented approach to isolating test datasets
3.1.c Uses a tool for versioning and tracing the training and test datasets used, thus enabling the non-contamination of test data to be checked or audited at a later stage
3.1.d The train-test split technical choices implemented are evaluated, documented and integrated into the model lifecycle documentation of the concerned models

Expl3.1 :

Ensuring that training and test datasets are kept separated is a principle known and mastered by most organisations. It can however be tricky in some particular configurations (e.g. continuous learning, privacy-preserving federated learning...).

Q3.2 : Privacy-preserving distributed learning projects
In the case of data science projects based on distributed or federated learning on multiple datasets and whose confidentiality must be preserved:

R3.2 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: federated leraning on sensitive data)

3.2.a We do not participate in privacy-preserving distributed learning projects | (Concerned / Not concerned)
3.2.b We master and implement approaches to develop test datasets in such a way that there is no cross-contamination between training and test data from different partners
3.2.c At this stage we do not master the methods for developing test datasets in such a way that there is no cross-contamination between training and test data from the different partners

Expl3.2 :

In this type of distributed learning project under conditions where the data is kept confidential, the question arises of how to compose a test dataset, making sure that it does not also appear in the training dataset (e.g. at another partner's premises).

Resources3.2 :

(Academic paper) Stratified cross-validation for unbiased and privacy-preserving federated learning, R. Bey, R. Goussault, M. Benchoufi, R. Porcher, January 2020

Q3.3 : Analysis of validation and test data
Within data science projects and when developing validation or test datasets, reflection and research on problematic phenomena (e.g. quality of certain data points, data that are not comparable due to recording tools or processes, social phenomena that are undesirable due to time, context, etc.) can be crucial for the meaning of performance scores. Your organisation:

R3.3 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

3.3.a Operates informally on this subject and relies on the practice of each collaborator member involved
3.3.b Does not have a documented approach to the subject, but the collaborators involved are trained on the risks and best practices on the subject
3.3.c Has a documented approach that is systematically implemented

Expl3.3 :

The use of AI models that have been validated and tested on historical data can be counterproductive when the historical data in question is contaminated by problematic phenomena. It seems essential to question this risk and to study the nature of the data used, the conditions under which they were produced and assembled, and what they represent.

Q3.4 : Performance validation
Does your organisation implement the following approaches:

R3.4 :
(Type: multiple responses possible)
(Select all the answer items that correspond to practices in your organisation)

3.4.a When developing a model, we choose the performance metric(s) prior to actually training the model, from among the most standard metrics possible
3.4.b The implementation of robustness metrics is considered and evaluated for each modelling project, and applied by default in cases where the input data may be subject to fine-grain alterations (e.g. images, sounds)
3.4.c The above practices that we implement are documented and integrated into the model lifecycle documentation of the concerned models, including the performance metrics chosen
3.4.d We have not yet introduced any such measures

Expl3.4 :

On choosing metrics upstream, see for example the risk of p-hacking / data dredging. On robustness, an intuitive definition is that a model is robust when its performance is stable when the input data is disturbed. For more information see the technical resources indicated.

Resources3.4 :

(Web article) The Comprehensive Guide to Model Validation Framework: What is a Robust Machine Learning Model?, Open Data Science, March 2020
(Web article) Testing Robustness Against Unforeseen Adversaries, Open AI, August 2019
(Academic paper) Robustness metrics : noise sensitivity score.
(Technical guide) Adversarial Robustness - Theory and Practice, Z. Kolter and A. Madry
(Technical guide) Understand Robustness, Nathan Lauga, 2020
(Academic paper) Towards Accountable AI: Hybrid Human-Machine Analysesfor Characterizing System Failure, B. Nushi, E. Kamar, E. Horvitz, June 2018

Q3.5 : Monitoring model performance over time
In cases where AI models developed by your organisation are used in production systems:

R3.5 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)
(Specific risk domain: use of AI models in production systems)

3.5.a The models we develop are not used in production systems | (Concerned / Not concerned)
3.5.b Performance is systematically re-evaluated when the model is updated
3.5.c Performance is systematically re-evaluated when the context in which the model is used evolves, which may create a risk on the performance of the model due to the evolution of the input data space
3.5.d The distribution of input data is monitored, and performance is regularly re-evaluated on the basis of updated test data
3.5.e Random checks are carried out on predictions to check their consistency
3.5.f We do not systematically set up this type of measure

Expl3.5 :

Even on a stable model, there is a risk that the input data will no longer remain in the target distribution after a certain time (population & distribution), for example, a variable that would no longer be filled in at the same frequency as before by users in an IS. It is therefore necessary to regularly re-evaluate the performance of a model used in its context of use. Monitoring the performance of models over time is also particularly important in cases of continuous learning, where there is a risk of model degeneration.

Resources3.5 :

(Software & Tools) Giskard AI: open source library to inspect & test AI models, monitor models in production, report unexpected predictions
(Technical guide) Continuous delivery for machine learning, D. Sato, A. Wider, C. Windheuser, September 2019
(Technical guide) Monitoring Machine Learning Models in Production - A comprehensive guide, Christopher Samiullah, March 2020
(Web article) Google's medical AI was super accurate in a lab. Real life was a different story, MIT Technology Review
(Web article) (In French) En route vers le cycle de vie des modèles !, G. Martinon, Janvier 2020
(Academic paper) Model reports, a supervision tool for Machine Learning engineers and users, A. Saboni, M. R. Ouamane, O. Bennis, F. Kratz, December 2021

Q3.6 : Decision making and ranges of indecision
For the definition of decision thresholds for models or automatic systems based on them, your organisation:

R3.6 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)

3.6.a Operates informally on this subject, depending upon the collaborators involved
3.6.b Has a documented approach that is systematically implemented
3.6.c Takes into account the possibility of maintaining ranges of indecision in certain cases
3.6.d The choices made for each model and implemented are documented and integrated into the lifecycle documentation of the models concerned.

Expl3.6 :

The study and selection of relevant decision thresholds for a given data science problem (threshold selection) is linked to the metrics selected. As discussed in the resources section of this evaluation issue, in some cases it may be worth considering the possibility of defining ranges of indecision.

Resources3.6 :

(Web article) Opening the algorithm's black box and understand its outputs, A. Saboni (Octo Technologies), April 2020

Q3.7 : Audits by independent third parties and verifiable claims When your organization communicates on the results or performance of an AI system, and makes it a marketing and communication argument to its stakeholders:

R3.7 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: external communication on the performance of AI systems)

3.7.a We do not communicate or do not need to communicate on the results or the performance of our AI systems, or do not use the results or performance of our AI systems as an argument to our stakeholders, we are not concerned by this assessment element | (Concerned / Not concerned)
3.7.b We communicate on the results or the performance of our AI systems and rely on them for our development without first having our work audited by an independent third party, without making evidence available
3.7.c We have our work audited by an independent third party, or we make evidence available, before communicating our results and using them to communicate and rely on with our stakeholders

Expl3.7 :

Developing an AI model, and determining a meaningful and reliable benchmark performance measure, is a complex challenge. It is therefore often difficult for an organisation to assert that it has achieved excellent results and to claim them with certainty. Where possible, however, it may be even more difficult to make evidence publicly available without revealing valuable information about the organisation's intellectual property and the value of the work carried out. In such cases, it is recommended to have an audit carried out by an independent third party (e.g. security, privacy, fairness, reliability...), in order to secure the results the organisation wishes to claim.

Resources3.7 :

(Academic paper) Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims, §2 p.8-20, April 2020

Section 4 - Ensuring model reproducibility and establishing the chain of accountability

[Model documentation]

An AI model is a complex object that can evolve over time. Tracing the stages of its development and evolution allows one to create a model lifecycle documentation, which is a prerequisite for reproducing or auditing a model. Furthermore, using automatic systems based on models whose rules have been "learned" (and not defined and formalised) questions the way organisations operate. It seems essential to guarantee a clear chain of responsibility, of natural or legal persons, for each model.

[⇧ back to the list of sections]
[⇩ next section]

Q4.1 : Lifecycle end-to-end documentation of ML models
Ensuring the traceability of all steps of the development of an AI model enables building up a model lifecycle documentation. Within your organisation, a lifecycle documentation of models is fed and maintained within the framework of data science projects, throughout the phases of data collection, design, training, validation and exploitation of the predictive models:

R4.1 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

4.1.a At this stage we have not implemented any such approach
4.1.b This information exists and is recorded so as not to be lost, but it may be scattered and it is not versioned
4.1.c They are compiled in a single document which systematically accompanies the model
4.1.d They are gathered in a single document that systematically accompanies the model and is versioned

Expl4.1 :

This concept of "model lifecycle documentation" of a learned AI model can take the form, for example, of a reference document containing all the important choices and the entire history of model development (data used, pre-processing carried out, type of learning and model architecture, hyperparameters selected, decision thresholds, test metrics, etc.), and the internal processes organising this activity. In particular, it is interesting to include the trade-offs that have been made and why (e.g. trade-offs precision-specification, performance-privacy, performance-computing cost, etc.).

Ressources4.1 :

(Software & Tools) Substra Framework: an open source framework offering distributed orchestration of machine learning tasks among partners while guaranteeing secure and trustless traceability of all operations
(Software & Tools) MLflow: an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry
(Software & Tools) DVC: an Open-source Version Control System for Machine Learning Projects
(Software & Tools) DAGsHub: a platform for data version control and collaboration, based on DVC a platform for data version control and collaboration, based on DVC
(Software & Tools) Model lifecycle template: template for Data Scientists to help collect all the information in order to trace the lifecycle from end to end of a model, 2020, Joséphine Lecoq-Vallon
(Academic paper) System-Level Transparency of Machine Learning, 2022, Meta AI: System Cards aims to increase the transparency of ML systems by providing stakeholders with an overview of different components of an ML system, how these components interact, and how different pieces of data and protected information are used by the system

Q4.2 : Conditions and limitations for using a model
In the context of data science projects, the "conditions and limits of validity" of a model designed, trained and validated by the organisation:

R4.2 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)

4.2.a Are not documented systematically, it relies on the practices of each collaborator involved
4.2.b Are systematically explicited and documented
4.2.c Are versioned
4.2.d Contain a description of the risks involved in using the model outside its "conditions and limits of validity"
4.2.e The documents presenting these "conditions and limits of validity" systematically accompany the models throughout their life cycle

Expl4.2 :

The aim is to make explicit and add to the model the description of the context of use for which it was designed and in which its announced performance is significant. This concept of "conditions and limits of validity" can take the form of a synthetic document or a specific section in the model lifecycle documentation.

Ressources4.2 :

(Academic paper) Model Cards for Model Reporting, M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, T. Gebru, January 2019
(Web article) Model Cards from Google is an open and scalable framework, and offers 2 examples: To explore the possibilities of model cards in the real world, we've designed examples for two features of our Cloud Vision API, Face Detection and Object Detection. They provide simple overviews of both models' ideal forms of input, visualize some of their key limitations, and present basic performance metrics.
(Web article) Model Cards for AI Model Transparency, Salesforce: examples of Model Cards used and published by Salesforce
(Software & Tools) AI FactSheets 360, an IBM Research project to foster trust in AI by increasing transparency and enabling governance: Increased transparency provides information for AI consumers to better understand how an AI model or service was created. This allows a consumer of the model to determine if it is appropriate for their situation. AI Governance enables an enterprise to specify and enforce policies describing how an AI model or service should be constructed and deployed.

Q4.3 : Analysis and publications of incidents reports
In data science projects, when unexpected behaviour of a model is observed:

R4.3 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)

4.3.a At this stage we do not analyse the incidents or unexpected behaviour observed
4.3.b We analyse incidents or unexpected behaviour encountered, but don't publish or share it
4.3.c We analyse incidents or unexpected behaviour encountered and publish them when relevant (e.g. article, blog)
4.3.d We get involved in clubs, networks or professional associations in the field of data science, and give feedback on incidents of unexpected behaviour that we observe

Expl4.3 :

Understanding or even mastering the behaviour of a learned AI model is a complex challenge. Lots of research is being done to develop methods and tools in this area, but much remains to be done. The sharing by practitioners of the unexpected incidents and behaviours they encounter contributes to the progress of the community.

Ressources4.3 :

(Software & Tools) AI Incident Registry, Partnership on AI
(Web article) Specification gaming examples in AI, Victoria Krakovna
(Web article) Learning from Tay's introduction: Incident analysis of the Tay chatbot, Microsoft, 2016
(Academic paper) Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims, §2.4 p.19, April 2020

Q4.4 : Value chain and chain of accountability
In the case of data science projects where several players, including internal to the organisation (teams, departments, subsidiaries), are involved throughout the value and accountability chains:

R4.4 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)
(Specific risk domain: roles and responsibilities in data science projects are divided up multiple actors)

4.4.a Within our organisation, data science projects are carried out end-to-end by autonomous teams, including the elaboration of datasets and the exploitation of models for its own account. Consequently, for each project, an autonomous team is solely responsible | (Concerned / Not concerned)
4.4.b We systematically identify the risks and responsibilities of each of the internal and external stakeholders with whom we work
4.4.c We systematically enter into contracts with upstream (e.g. data suppliers) and downstream (e.g. customers, model-using partners) players
4.4.d We do not systematically implement this type of measure

Expl4.4 :

It is important to ensure that organisations upstream and downstream the chain identify and take responsibility for their segments of the value chain.

Q4.5 : Subcontracting of all or part of the data science activities
Data science activities subcontracted to a third party organisation(s) are subject to the same requirements your organisation applies to itself:

R4.5 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: subcontracting of data science activities)

4.5.a Not concerned, we do not subcontract these activities | (Concerned / Not concerned)
4.5.b Yes, our responses to this evaluation take into account the practices of our subcontractors
4.5.c No, our answers to this evaluation do not apply to our subcontractors and on certain points they may be less advanced than us

Expl4.5 :

As in the reference frameworks of IS management (ISO 27001) or GDPR in the European Union, it is important not to dilute responsibilities in uncontrolled subcontracting chains. This should apply, for example, to consultants, freelancers who come to reinforce an internal team on a data science project. For example, it is possible to ask sub-contractors to carry out the same evaluation on their own account and share their results with you.

Q4.6 : Distribution of the value creation
In the case of data science projects where several partners work alongside your organisation to develop a model, and that model is or will be the subject of an economic activity:

R4.6 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)
(Specific risk domain: roles and responsibilities in data science projects are divided up multiple actors)

4.6.a Our organisation carries out its data science activities autonomously, including the development of datasets and the exploitation of models for its own account. It is therefore not concerned | (Concerned / Not concerned)
4.6.b At this stage we have not structured this aspect of multi-partner data science projects
4.6.c In these cases, we contract the economic aspect of the relationship with the stakeholders involved upstream of the project
4.6.d Our organisation has a policy that responsibly frames the sharing of value with the stakeholders involved

Expl4.6 :

When several partners work together to develop a model, it is important that the distribution of value resulting from an economic activity in which the model plays a role is made explicit and contractualized. In some cases this can be a complex issue, for example when a model is trained in a distributed manner over several datasets.

Ressources4.6 :

(Code repository) Exploration of dataset contributivity to a model in collaborative ML projects, an open source project led by Labelia Labs (ex- Substra Foundation)

Section 5 - Using models responsibly and in confidence

[Using the models]

An AI model can be used as an automatic system, whose rules or criteria are not written in extenso and are difficult to explain, discuss or adjust. Using automatic systems based on AI models whose rules have been "learned" (and not defined and formalised) therefore questions the way organisations design and operate their products and services. It is important to preserve the responsiveness and resilience of organisations using those AI models, particularly in dealing with situations where AI models have led to an undesirable outcome for the organisation or its stakeholders. In addition, efforts are therefore needed on the interpretation and explanation of the choices made using these systems.

[⇧ back to the list of sections]
[⇩ next section]

Q5.1 : Exploitation of AI models for one's own account
If your organisation uses AI models on its own behalf:

R5.1 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)
(Specific risk domain: use of AI models, provision or operation of AI model-based applications for customers or third parties)

5.1.a Our organisation does not use ML models on its own behalf | (Concerned / Not concerned)
5.1.b An AI models register identifies all the models used by the organisation and is kept up-to-date
5.1.c For each model there is an owner defined, identifiable and easily contactable
5.1.d For each model, we systematically carry out a risk assessment following any incidents, failures or biases
5.1.e Monitoring tools are put in place to ensure continuous monitoring of systems based on AI models and can trigger alerts directly to the team in charge
5.1.f For each model, we define and test a procedure for suspending the model and a degraded operating mode without the model, in order to prepare for the case where the model is subject to failure or unexpected behaviour
5.1.g For each model, we study its entire lifecycle (all the steps and choices that led to its development and evaluation), as well as its conditions and limits of validity, in order to understand the model before using it
5.1.h We always use the models for uses in accordance with their conditions and limits of validity
5.1.i We have not yet put in place such measures

Expl5.1 :

Using automatic systems based on models whose rules have been "learned" (and not defined and formalised) questions the way organisations design and operate their products and services. It is important to assess the consequences and reactions in the event of an incident. Furthermore, it is important that persons in charge is clearly identified so that no stakeholder is left helpless in the face of an unexpected or inappropriate consequence. Finally, it is important to consider the "conditions and limits of validity" of the models used to ensure that the intended use is appropriate.

Q5.2 : Development of AI models on behalf of third parties
If your organisation provides or operates AI model-based applications to customers or third parties:

R5.2 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)
(Specific risk domain: use of AI models, provision or operation of AI model-based applications for customers or third parties)

5.2.a Our organisation does not provide its customers or third parties, nor does it operates on behalf of third parties, with applications based on ML models | (Concerned / Not concerned)
5.2.b An AI models register identifies all models or applications used by its customers and/or by the organisation on behalf of third parties, and is kept up-to-date
5.2.c For each model or application for a customer or a third party we have a defined, identifiable and easily reachable owner
5.2.d For each model or application for a customer or a third party, we systematically carry out a risk assessment resulting from possible incidents, failures, biases, etc., in order to identify the risks involved
5.2.e Monitoring tools are in place to ensure continuous monitoring of ML systems and can trigger alerts directly to the responsible team
5.2.f For each model or application for a customer or a third party, we define and test a procedure for suspending the model and a degraded operating mode without the model, in order to prepare for the case where the model is subject to failure or unexpected behaviour
5.2.g For each model or application for a client or third party, we study its entire lifecycle and its conditions and limits of validity to understand the model before using it
5.2.h We supply our customers or operate on their behalf with models or applications for uses in accordance with their conditions and limits of validity
5.2.i We have not yet put in place such measures

Expl5.2 :

Q5.3 : Management of problematic predictions, bypass process, human agency
Automatic systems, especially when based on AI models, are used in production generally to gain efficiency. By nature, they occasionally generate undesirable results for the organisation and its stakeholders (e.g. wrong prediction), as they will never achieve 100% performance.

R5.3 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)
(Specific risk domain: use of AI models, provision or operation of AI model-based applications for customers or third parties)

5.3.a Our organisation does not use AI models on its own behalf or on behalf of its clients, and does not provide its clients with applications based on AI models | (Concerned / Not concerned)
5.3.b We implement AI models in integrated automatic systems, without mechanisms to overcome or avoid undesirable results due to model predictions
5.3.c We integrate, in automatic systems based on AI models, the functionalities to manage these cases of undesirable results. For such cases, we set up mechanisms allowing a human operator to go against an automatic decision to manage such undesirable results or incidents
5.3.d In addition to incident management mechanisms, in automatic systems based on AI models, when the confidence interval for the automatic decision is not satisfactory a human operator is called upon
5.3.e We systematically apply the principle of "human agency", the outputs of the AI models that we implement are used by human operators, and do not serve as determinants for automatic decisions

Expl5.3 :

Ressources5.3 :

(Technical guide) Monitoring Machine Learning Models in Production - A comprehensive guide, Christopher Samiullah, March 2020
(Software & Tools) Giskard AI: open source library to inspect & test AI models, monitor models in production, report unexpected predictions

Q5.4 : Explicability and interpretability
Within data science projects aiming at developing AI models:

R5.4 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)

5.4.a Our organisation is not yet familiar with the methods and tools for explaining and interpreting AI models
5.4.b We are interested in the explicability and interpretability of AI models and are in dialogue with our stakeholders on this subject
5.4.c We ensure that the models we develop provide, when relevant, at least a level of confidence together with each prediction made
5.4.d We determine the best compromises between performance and interpretability for each model we develop, which sometimes leads us to opt for a model that is simpler to explain to the stakeholders
5.4.e We master and implement advanced approaches for the explicability and interpretability of models

Expl5.4 :

Explanability and interpretability are key issues, in line with the growing demands for transparency, impartiality and accountability. In some cases, regulations even impose it. Technical resources such as SHAP or LIME provide a first-hand introduction to the topic (see resources associated with this assessment element).

Ressources5.4 :

(Web article) User confidence in systems involving Artificial Intelligence, Blog Octo Technologies, October 2019
(Technical guide) Interpretable Machine Learning, A Guide for Making Black Box Models Explainable, Christoph Molnar
(Web article) Understanding model predictions with LIME, blog L. Hulstaert, 2018
(Software & Tools) SHAP: A game theoretic approach to explain the output of any machine learning model.
(Software & Tools) Shapash: a MAIF Datalab project which aims to make machine learning interpretable and understandable by everyone. It provides several types of visualization that display explicit labels that everyone can understand
(Software & Tools) FACET: a BCG Gamma project of an open source library for human-explainable AI. It combines sophisticated model inspection and model-based simulation to enable better explanations of supervised machine learning models
(Web article) In some cases, regulations impose being able to explain how an automated system came to a certain outcome (see for example article 22 of the GDPR in the European Union, article 10 of the "Informatique & Libertés" law in France, cited in particular in the Hippocratic Oath for data scientist.

Q5.5 : Transparency towards stakeholders interacting with an AI model
Your organisation uses for its own account, provides to its customers or operates on behalf of its customers applications based on AI models with which users can interact. What measure does it implement to inform users?

R5.5 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)
(Specific risk domain: use of AI models, provision or operation of AI model-based applications for customers or third parties)

5.5.a Our organisation does not use AI models on its own behalf or on behalf of its clients, and does not provide its clients with applications based on AI models | (Concerned / Not concerned)
5.5.b Users are not informed that they are interacting with an AI model developed with machine learning methods
5.5.c An information notice is made available in the terms and conditions of the system or an equivalent document, freely accessible
5.5.d The system or service is explicit to the user that an AI model is being used
5.5.e The system or service provides the user with additional information on the results it would have provided in slightly different scenarios (e.g. "counterfactual explanations" such as the smallest change in input data that would have resulted in a given different output)
5.5.f We are pionneers in using public AI registers, enabling us to provide transparency to our stakeholders and to capture user feedbacks

Expl5.5 :

Using automatic systems based on models whose rules have been "learned" (and not defined and formalised) questions the functioning of organisations but also the relationship of users to digital systems and services. In most cases it is important to inform users that they are not interacting with conventional business rules.

Ressources5.5 :

(Academic paper) Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR, S. Wachter, B. Mittelstadt, C. Russell, 2018
(Technical guide) Interpretable Machine Learning - Counterfactual explanations, C. Molnar, 2020
(Web article) AI registers: finally, a tool to increase transparency in AI/ML, Natalia Modjeska, December 2020
(Whitepaper) Public AI Registers: Realising AI transparency and civic participation in government use of AI, Saidot, Septembre 2020

Q5.6 : Logging predictions from AI models
If your organisation provides or operates AI model-based applications to customers or third parties, to enable auditability of such applications and facilitate their continuous improvement, it is key to implement predictions logging. On that topic:

R5.6 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic) (Specific risk domain: use of AI models, provision or operation of AI model-based applications for customers or third parties)

5.6.a Our organisation does not use AI models on its own behalf or on behalf of its clients, and does not provide its clients with applications based on AI models | (Concerned / Not concerned)
5.6.b Logging predictions from AI models used in production is not yet systematically implemented
5.6.c We systematically log all predictions from AI models used in production (coupled with the input data and the associated models references)

Expl5.6 :

Using automatic systems based on AI models whose rules have been learned questions the way organisations design and operate their products and services. It is important to preserve the responsiveness and resilience of organisations using those AI models, particularly in dealing with situations where AI models have led to an undesirable outcome for the organisation or its stakeholders. To that end, logging predictions from AI models used in production (coupled with the input data and the associated models references) is key to enable ex-post auditability on concrete use cases. It should be noted that predictions might involve personal data and be regulated by GDPR. Anonymization of processed data, when logged & made available to customers or internal operators, could be part of a solution to avoid leaking sensitive information.

Section 6 - Anticipating, monitoring and minimising the negative externalities of data science activities

[Negative externalities]

The implementation of an automatic system based on an AI model can generate negative social and environmental externalities. Awareness of this is essential, as well as anticipating, monitoring and minimising the various negative impacts.

[⇧ back to the list of sections]

Q6.1 : Environmental impact (energy consumption and carbon footprint)
About the environmental impact of the data science activity in your organisation:

R6.1 :
(Type: multiple responses possible)
(Select all the answer items that correspond to practices in your organisation)

6.1.a At this stage we have not studied specifically the environmental impact of our data science activity or our AI models
6.1.b We have developed indicators that define what we want to measure regarding the energy consumption and the carbon footprint of our data science activity or our models
6.1.c We measure our indicators regularly
6.1.d We include their measurements in the model identity cards
6.1.e Monitoring our indicators on a regular basis is a formalised and controlled process, from which we define and drive improvement objectives
6.1.f We consolidate an aggregated view of the energy consumtion and carbon footprint of our data science activities
6.1.g This aggregated view is taken into account in the global environmental impact evaluation of our organization (e.g. carbon footprint, regulatory GHG evaluation, Paris Agreement compatibility score...)
6.1.h The energy consumption and carbon footprint of our data science activity or our models is made transparent to our counterparts and the general public

Expl6.1 :

It is important to question and raise awareness of environmental costs. In particular one can: (i) measure the environmental cost of data science projects, (ii) publish transparently their environmental impact, expliciting the split between train and production phases, (iii) improve on these indicators by working on different levers (e.g. infrastructure, model architecture, transfer learning, etc.). It has been demonstrated that such choices can impact the carbon footprint of model training up to x100-x1000 (see resources below).

Ressources6.1 :

(Software & Tools) ML Impact Calculator
(Software & Tools) Code Carbon: python library for evaluation the carbon cost of executing a script
(Web article) (In French) La frugalité, ou comment empêcher l’IA de franchir les limites, Geoffray Brelurut (Quantmetry), June 2023
(Academic paper) Carbon Emissions and Large Neural Network Training, David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, Jeff Dean, 2021. Extract : Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X
(Academic paper) Estimating the carbon footprint of Bloom, a 176B parameter language model, Alexandra Sasha Luccioni, Sylvain Viguier, Anne-Laure Ligozat, 2022. Extract: While we will predominantly focus on model training, we will also take into account the emissions produced by manufacturing the computing equipment used for running the training, the energy-based operational emissions, as well as the carbon footprint of model deployment and inference
(Web article) (In French) IA durable : ce que les professionnels de la donnée peuvent faire, Geoffray Brerelut and Grégoire Martinon, May 2021
(Academic paper) Sustainable AI: Environmental Implications, Challenges and Opportunities, Facebook AI, 2021
(Web article) The carbon impact of artificial intelligence, Payal Dhar, 2020
(Web article) AI and Compute, OpenAI, 2018
(Academic paper) Green AI, R. Schwart et al. 2020
(Academic paper) Energy and Policy Considerations for Deep Learning in NLP, E. Strubell et al. 2019
(Public declaration) (In French) DÉPLOYER LA SOBRIÉTÉ NUMÉRIQUE, The Shift Project, 2020
(Web article) How to stop data centres from gobbling up the world’s electricity, Nicolas Jones, 2018
(Web article) AI and Climate Change: How they’re connected, and what we can do about it, AI Now Institute, 2019
(Academic paper) The role of artificial intelligence in achieving the Sustainable Development Goals, S. Vinuesa et al. 2020

Q6.2 : Social impact In some cases, the implementation of an automatic system based on an AI model can generate negative externalities on upstream stakeholders (e.g. annotation of data), and on downstream stakeholders (e.g. automation of certain positions). Whenever you plan to develop or use an AI model:

R6.2 :
(Type: single answer)
(Select one answer only, which best corresponds to the level of maturity of the organisation on this topic)

6.2.a At this stage we are not looking at the social impact of our data science activity or our AI models
6.2.b In some cases we study the social impact
6.2.c We study the social impact in each project
6.2.d We study the social impact in each project and it is documented in the lifecycle documentation of each model
6.2.e We study the social impact in each project, it is documented in the lifecycle documentation of each model, and we systematically engage in a dialogue with the relevant stakeholders upstream and downstream the value chain.

Expl6.2 :

It is important for an organisation to question and exchange with its stakeholders. This applies both downstream (e.g. automation of certain jobs) and upstream (e.g. data annotation tasks that can be very violent) the value chain.

Q6.3 : Ethics and non-maleficence
Within your organisation:

R6.3 :
(Type: multiple responses possible)
(Select all response items that correspond to practices in your organisation. Please note that some combinations would not be coherent)

6.3.a At this stage we have not yet addressed the ethical dimension of our data science projects and activities
6.3.b We are studying the ethical dimension of our data science projects and activities, it is a work in progress
6.3.c Employees involved in data science activities receive training in ethics
6.3.d Our organisation has adopted an ethics policy
6.3.e For projects justifying it, we set up an independent ethics committee or ask for the evaluation of an organisation validating the ethics of the projects

Expl6.3 :

Working with large volumes of data, some of which may be sensitive, using automatic systems based on models whose rules have been "learned" (and not defined and formalised) raises questions about the way organisations function and the individual responsibility of each contributor. It requires considering carefully the uses of such sytems. It is therefore important for the organisation to ensure that ethical issues are not unknown to its collaborators. A recurring example is that some AI systems or services designed to adapt to user behaviour may influence users (e.g. by seeking to maximise their time of use or the money they spend) and present significant risks of manipulation or addiction.

Ressources6.3 :

(Official report) Rapport Ethics and Responsibility of Public Algorithms, Etalab / ENA, January 2020, in French
(Public declaration) Montreal Declaration for Responsible RNs
(Public declaration) Holberton-Turing Oath
(Public declaration) Hippocratic Oath for data scientist
(Public declaration) Future of Life's AI principles
(Public declaration) International Charter for Inclusive AI, Arborus and Orange
(Course) Practical data ethics, fast.ai: an excellent online course combining reading lists and instructional videos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Responsible and Trustworthy Data Science - Evaluation framework

Evaluation framework to assess the maturity of an organisation

Section 1 - Protecting personal or confidential data and comply with regulatory requirements

Section 2 - Preventing bias, developing non-discriminatory models

Section 3 - Assessing model performance rigorously

Section 4 - Ensuring model reproducibility and establishing the chain of accountability

Section 5 - Using models responsibly and in confidence

Section 6 - Anticipating, monitoring and minimising the negative externalities of data science activities

FilesExpand file tree

assessment_framework_eng.md

Latest commit

History

assessment_framework_eng.md

File metadata and controls

Responsible and Trustworthy Data Science - Evaluation framework

Evaluation framework to assess the maturity of an organisation

Section 1 - Protecting personal or confidential data and comply with regulatory requirements

Section 2 - Preventing bias, developing non-discriminatory models

Section 3 - Assessing model performance rigorously

Section 4 - Ensuring model reproducibility and establishing the chain of accountability

Section 5 - Using models responsibly and in confidence

Section 6 - Anticipating, monitoring and minimising the negative externalities of data science activities