Early work has suggested that anonymization destroys the utility of data for machine learning tasks [
45]. Many methods for optimizing anonymized data as a training set for prediction models have since been developed. They show that this is not actually true. Initially, these methods focused on simple anonymization techniques, such as
k-anonymity, and simple prediction models, such as decision trees and on applications in distributed settings [
35,
46]. As a result of these developments, evaluating (novel) anonymization methods by measuring the usefulness of output data for predictive modeling tasks has become a standard practice in academia [
47,
48]. More recently, a broader spectrum of prediction and privacy models has been investigated. Some authors proposed general-purpose anonymization algorithms to optimize prediction performance. While most of these algorithms have been designed in such a way that the resulting anonymized data is guaranteed to provide a degree of protection based on specific privacy models only [
49,
50], they allow for any type of prediction model to be used. In contrast, in other works, privacy-preserving algorithms for optimizing the performance of specific prediction models were developed [
51,
52]. Many recent studies focused on sophisticated models, such as support vector machines [
51,
53,
54] and (deep) neural networks [
55‐
57]. More complex and comprehensive privacy models have also received significant attention. In particular, the differential privacy model was investigated extensively [
53,
55,
56,
58‐
62]. It is notable, that among these more modern approaches, a variety has focused on biomedical data [
56,
57,
60]. We note, however, that these developments originate from the computer science research community and if the developed algorithms are published, then typically only in the form of research prototypes.
In parallel, several practical tools have been developed that make methods of data anonymization available to end-users by providing easy-to-use graphical interfaces. Most notably,
μ−
ARGUS [
63] and
sdcMicro [
64] are tools developed in the context of official statistics, while ARX has specifically been designed for applications to biomedical data [
19].
μ-ARGUS and sdcMicro focus on the concept of
a posteriori disclosure risk control which is prevalent in the statistics community. In this process, data is mainly transformed manually in iterative steps, while data utility, usefulness and risks are monitored continuously by performing statistical analyses and tests. ARX implements a mixture of this approach and the
a priori disclosure risk control methodology. This means that data is anonymized semi-automatically. In each iteration, the data is sanitized in such a way that predefined thresholds on privacy risks are met while the impact on data utility is minimized. A balancing is performed by repeating this process with different settings, thereby iteratively refining output data. This approach has been recommended for anonymizing health data (see, e.g. [
7,
12] and [
13]) and it enables ARX to support an unprecedentedly broad spectrum of techniques for transforming data and measuring risks. All three tools provide users with methods for assessing and optimizing the usefulness of anonymized data for a wide variety of applications. ARX is, however, the only tool providing support for privacy-preserving machine learning.