The proposed Indian Data Protection Act (PDPA 2020) refers to “Personal Data”, “Anonymization” and “De-identification/Pseudonymization”.
Anonymisation is defined as an “Irreversible” process of transforming the personally identifiable data to a form where the identity is irreversibly removed. Anonymization frees the data from PDPA controls.
On the other hand
de-identification” means the process by which a data fiduciary or data processor may remove, or mask identifiers from personal data, or replace them with such other fictitious name or code that is unique to an individual but does not, on its own, directly identify the data principal;
The definition of de-identification includes “Pseudonymization” by way of replacement of identifiers from the identifiable personal data set.
De-identification is a technical control that is used as a feature to mitigate the risk during a data processing environment so that the real identity of the information attached to an identified data principal is not shared within the organization.
Naavi has been advocating the use of a “Pseudonymization Gateway” as a standard feature so that an organization immediately pseudonymize all identity parameters at the entrance of a personal data and create a confidential mapping of pseydonymized parameters with the real parameters and a unique data identity to enable re-identification when required.
If this suggestion is technically implemented, then the entire organization will work on the personal data processing on the de-identified/pseudonymized data reducing the risk of data breach to near zero. The Pseuodymization gateway would be managed by an “Internal Data Controller” and will maintain the mapping table as securely as possible with appropriate encryption, split keys in the custody of multiple custodians etc.
When the processed data is to be disclosed, if it is to be re-identified, the designated “Internal Data Disposers” would be responsible to re-identify the data and create the “Processed version of the data with real identity” and then disclose it to the recipients as may be required.
The controls for personal data breach mitigation therefore is confined to the Internal data controllers and internal data disposers
(P.S: Here the word ‘internal’ refers to the person/s being employees of the organization though the disclosure of information is to outsiders. The term ‘dispose” refers to both external disclosures and destruction of identity of a personal data or deletion of the personal data)
PDPA has not used the term “Differential Privacy” which is a term developed by data scientists in the Big Data processing scenario.
The Sri Krishna Committee while winding up its recommendations made a comment that there is a separate need for developing regulation of “Community Data” which referred to a form of aggregation of data which is relevant for “Differential Privacy”. This is now before the Kris Gopalakrishna committee on Data Governance.
As a concept, “Differential Privacy” addresses the need for processing of aggregated data in such a manner that identity of a data subject becomes irrelevant in the aggregation and disclosure. In other words, while the aggregation happens with the identifiable data, the process of aggregation and processing is managed in such a manner that the disclosed data does not affect the privacy of an individual whose personal data is a component of the processed data.
One of the definitions of “Differential Privacy” is that
“Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.”
For example, A, B and C undergo a medical test and A and B are diagonized diabetic and C as say healthy. (In actual situation the numbers would be large and A, B and C may represent groups of a large number of persons). Now when we say 33% of the persons are healthy and 67% are diabetic, we are disclosing the personal information of A,B or C. However, if the disclosed data remains at the level of these percentages, the identity of the individuals remains masked.
When the data of another subject is added or deleted from the data set, then (assuming the large numbers), the pattern of the disclosed data does not reveal the identity of the person whose data was added or subtracted. Since the query result of the processed data cannot be used to infer whether the person whose data was added or subtracted was diabetic or healthy, it is considered that the “Privacy is preserved”.
The development of processing that meets this criteria is referred to as “Differential Privacy” by data scientists.
More technically,
” A Processing algorithm is considered differentially private if an observer seeing its output cannot tell if a particular individual’s information was used in the computation”
This concept is used by statistical organizations processing personal information.
The fact that Indian PDPA does not refer to Differential Privacy, (nor the other laws such as GDPR), is because, these data protection laws consider that the statistical processing of the type referred to above can be done with “De-identified” or “Pseudonymized Data”. Hence the issue of identifying an individual whose data set moves in or out of a collection of data does not matter for the privacy of an individual.
A Big Data Processor who is today looking at Differential Privacy can as well introduce an automated data anonymization process so that all incoming identified data sets become anonymised data sets at the gateway and remains at the machine level visibility. When the data is filtered into the internal systems visible for human beings it is already in an “Anonymized State” and hence the “Differential Privacy” concept may not be required.
This suggestion was made by the undersigned to one company processing CCTV footages and can be a substitute for differential privacy.
If there is any specific processing requirement where the input has to be on an identified basis and disclosure is required to be made, then the use of “Differential Privacy as an algorithmic feature” becomes the responsibility of the processor under “Legitimate Interest”.
The Kris Gopalakrishna committee on Data Governance may need to debate “Differential Privacy” in greater detail.
If the Government pursues the concept of “Open Data” and wants to collect, process and disclose identifiable personal data in an aggregated form for the benefit of the society, the concept of Differential Privacy may be useful.
Similarly, data research organizations harvesting personal data from public sources and profiling the behaviour of communities also need to adopt the principles of differential data privacy into their processing and present a legitimate interest claim when they submit DPIA to the data processing authorities.
(This topic requires further discussion. I have tried to seed some thoughts for discussion and comments and inputs are invited..Naavi)
Naavi