差分隐私算法基础

本书为差分隐私经典理论书籍《The Algorithmic Foundations of Differential Privacy》的中文译本。

Cynthia Dwork, Aaron Roth（著）
侯建鹏（译）

1. 差分隐私的承诺

“差分隐私” 描述了数据持有者或管理者对数据主体的承诺：“无论是否存在其他研究、数据集或者信息源，您都不会因数据被用于任何研究或分析而受到不利影响”。理想情况下，满足差分隐私的数据库机制允许我们对广泛的机密数据，在无需借助于数据净室 (data clean rooms)、数据使用协议、数据保护计划或受限视图的情况下，进行精准分析。尽管如此，数据的可用性最终还是会受到影响：信息恢复的基本规律(Foundamental Law of Information Recovery) 指出，对太多问题作出过于精准的回答会以惊人的方式摧毁隐私¹。差分隐私算法研究的目标是竭尽所能地延缓这一必然结果。

差分隐私解决了这样一个“悖论”：在获取有效群体信息的同时不暴露个体信息。一个医学数据库可能会告诉我们吸烟致癌，从而影响保险公司对某个吸烟者长期医疗费用的评估。那么这一分析是否损害到了该吸烟者的利益？也许是的——如果保险公司知道他抽烟，他的保费可能会因此上涨。但他也可能因此获益——在了解了自身存在的健康风险后，他启动了戒烟计划。那吸烟者的隐私是否受到了损害？可以肯定的是，经过分析后对他的了解比分析前更多，但他的信息被 “泄露” 了吗？差分隐私认为事实并非如此，根本原因在于，无论该吸烟者是否参与了此项调研，此分析对他的影响都是相同的。影响吸烟者的是研究中得出的结论，而与他/她是否出现在该研究数据集中无关。

差分隐私确保了无论任何个体加入或退出数据集都会得出相同的结论，例如，吸烟致癌。具体来讲，它确保的是任意输出序列 (对查询的响应) 出现的可能性“基本上”独立于任何个体的存在与否。这里，结果的概率被隐私机制 (由数据管理员控制) 采取的随机选择控制，而术语 “基本上” 被参数 ϵ 约束。较小的 ϵ 会产生更好的隐私保护效果 (和不太准确的响应)。

差分隐私是一种定义，而不是一种算法。对一个给定的计算任务 T 和一个定值 ϵ，有许多差分隐私算法可以通过满足 ϵ-差分隐私的方式实现该计算任务 T。其中一些算法比其他算法精度更高。当 ϵ 很小时，为任务 T 找到一个高精度的 ϵ-差分隐私算法很难，就好比为特定的计算任务找到一个数值稳定的算法一样困难。

这个结果在章节8.1中被证明，它不仅适用于差分隐私，也适用于所有隐私保护相关的数据分析技术。

1.1 面向隐私保护的数据分析

差分隐私是针对“如何在进行数据分析的同时保护隐私”这一问题提出的隐私定义。我们简明扼要地解决了其他方法在处理该问题时存在的一些不足。

数据无法完全匿名并保持可用性。一般来讲，数据越丰富，它的可用性就越高且越值得挖掘。这引出了“匿名化”和“删除个人身份信息”的概念，其目的是隐藏部分数据记录，公开其余部分并用于分析。然而，数据的丰富性使得有时候可以通过一组出人意料的字段或者属性来 “识别” 一个人，例如邮政编码、出生日期和性别的组合，甚至是三部电影的名称以及相应的大致观看日期。这种 “识别” 能力可被用于链接攻击 (linkage attack)，来将“匿名” 记录与不同数据集中的非匿名记录进行匹配。例如，有研究者通过把匿名医疗数据与公开可用的选民登记记录进行匹配，从而破解了美国马萨诸塞州州长的医疗记录。此外，网飞公司 (Netflix) 因将用户的历史观看数据匿名公开在一个电影数据集中用作推荐竞赛的训练数据，而被研究者通过链接互联网电影数据库 (IMDb) 的方式识别出该公司的订阅用户。

“匿名”记录的再识别 (re-identification) 并非唯一的风险。我们显然不希望看到“匿名”的数据记录被重新识别，不仅是因为再识别本身会揭示数据集中成员的身份，同时也因为记录中可能包含敏感信息，如果将其与个人相关联，有可能对个人造成损害。从特定紧急护理中心收集的指定日期的医疗记录可能只包含有限的病症或诊断记录。一条附近居民在这段时间访问该护理中心的额外信息，足以将该居民的病情状况缩小到相当狭窄的范围。虽然无法将特定记录与该居民匹配，但这对他的隐私保护作用微乎其微。

对大型集合的查询无法充分保护隐私。针对特定个体的查询无法提供安全且准确的回答，实际上人们可能希望直接屏蔽此类查询（如果能够识别它们）。如下面的差分攻击 (differencing attack) 所示，强制在大型集合上查询也并非灵丹妙药。假设已知X先生在某医疗数据库中，将两个针对大型数据集的查询“数据库中有多少人具有镰状细胞性状？”和“数据库中，除了X先生以外，有多少人具有镰状细胞性状？”的答案综合起来，就可以得出X先生的镰状细胞性状情况。

查询审计 (query auditing) 存在问题。人们可能会试图审计查询序列和响应序列，目的是在考虑历史记录的情况下，若回答当前查询会危及隐私，则禁止提供任何响应。例如，审计员会查找可能构成差分攻击的查询对。然而，这种方法存在两个困难。首先，拒绝回答查询的行为本身就可能泄漏信息。其次，查询审计在计算上也许是行不通的；实际上，如果查询语言足够丰富，甚至可能不存在用于判断一对查询是否构成差分攻击的算法程序。

汇总统计 (summary statistics) 并不“安全”。从某种意义上讲，汇总统计是一种失败的隐私保护解决方案，因为它无法抵御上述差分攻击。汇总统计还在存在其它问题，包括无法抵御针对数据库的各种重构攻击 (reconstruction attacks)，其中，数据库中的每个人有一个需要保护的“机密位 (secret bit)”。从实用性的角度来看，我们希望允许诸如“满足属性 \(P\) 的人中有多少人的机密位的值为 \(1\)？”的查询。而另一方面，攻击者的目标则是显著提高猜中个人机密位的机会。第8.1节描述的重构攻击表明，即使是线型数量的此类查询也难以防御：除非引入足够的误差，否则几乎所有的机密位都可以被重构出来。

发布汇总的统计数据存在风险，一个明显的例证是通过统计技术的应用可以判断一个人是否参与了某项全基因组关联研究，该技术最初的用途是在法医鉴定中确认某个人的 DNA 是否存在于生物混合物样本中。根据人类基因组计划的网站，“单核苷酸多态性，或 SNPs (发音为“snips”)，指的是当基因组序列中的一个单核苷酸 (A、T、C 或 G) 发生改变时产生的 DNA 序列变种。例如，一个 SNP 可能会将 DNA 序列 AAGGCTAA 改变为 ATGGCTAA。” 在这种情况下，我们称有两个等位基因: A 和 T。对于这样的 SNP，我们可以抛出一个问题：在特定的参考群体中，两个可能的等位基因的频率分别是多少？在已知参考群体中 SNPs 的等位基因频率的情况下，我们可以研究患有特定疾病的亚群（即“病例”组）中这些频率有何不同，从而寻找与该疾病相关的等位基因。出于这个原因，全基因组关联研究中可能包含病例组的大量 SNPs 的等位基因频率。根据定义，这些等位基因频率只是汇总后的统计数据，并且研究人员的假设是（当然，该假设是错误的），通过汇总的方式，隐私可以得到保护。然而，从理论上讲，如果拥有个人的基因组数据，就有可能确定该个体是否属于病例组（也就是，是否患有对应的疾病）。为了应对这个问题，美国国立卫生研究院和惠康基金会不再允许公众访问他们资助的全基因组关联研究的汇总频率数据。

即使对于差分隐私而言，这也是一个颇具挑战的难题，因为涉及的基因测量数量巨大（数十万甚至上百万），而任一病例组中的个体数量相对较少。

“寻常”的事实并不“安全”。如果数据主体被长期追踪，“寻常”的事实，例如购买面包，也可能会泄露隐私。例如，参考 T 先生，他年复一年地定期购买面包，直到突然变得很少买面包。分析师也许会得出结论，T 先生很有可能被诊断患有2型糖尿病。分析师的结论可能是正确的，也可能是错误的；无论哪种情况，T 先生的隐私都会受到损害。

牺牲“少数人”。在某些情况下，特定技术实际上可以为数据集中的“典型”成员，或更一般地讲，为“大多数”成员提供隐私保护。在这种情况下，人们经常听到这样的论点，即该技术足够好了，因为它只会泄露“少数几个”参与者的隐私。如果我们抛开对极端情况（被放弃的人的隐私刚好最重要）的担忧，牺牲“少数人”的理念并不完全没有价值：只需在权衡成本和收益的基础上，让社会各界达成共识。与牺牲“少数人”的理念相匹配的隐私定义尚未被提出；不过，对于单个数据集，可以通过随机选择一些数据行组成子集并公开它们来模拟“少数人”的隐私被泄漏 (第4章，引理 4.3)。随机子集中用来描述统计分析质量的采样边界 (sampling bounds) 决定了要公开的行数。当牺牲“少数人”的理念不被接受时，差分隐私成为了另外一种选择。

1.2 文献说明

Sweeney [81] 将选民登记记录与“匿名”医疗记录数据联系起来；Narayanan 和 Shmatikov 对网飞公司发布的匿名排名数据进行了链接攻击 [65]。生物混合物相关的研究是由 Homer 等人完成的 [46]。重构攻击是由 Dinur 和 Nissim [18] 首先提出的。

2. 基本术语

本章将介绍差分隐私的正式定义，阐述其意义，并列举它的一些关键特性。

2.1 计算模型

我们假设存在一个可信赖的管理者 (curator)，他持有数据库 \(D\) 中的个人数据。该数据库通常由 \(n\) 行数据组成，且每一行对应一个用户。通俗地讲，隐私的目标是在保护每一行数据的同时允许对整个数据库进行统计分析。

在非交互式或离线模型中，管理者会一次性生成某种对象，比如，“合成数据库 (synthetic database)”，汇总的统计数据集合，或者“无害化数据库 (sanitized database)”。在此次发布后，管理者不再扮演任何角色，原始数据可能会被销毁。

一个查询指的是一个应用于数据库的函数。交互式或在线模型允许数据分析人员自适应地提出查询，即根据先前查询的响应决定接下来的查询。

可信的管理者可以被用户执行的协议替代，它采用安全多方协议 (secure multiparty protocols) 的加密技术，但大多数情况下我们不会诉诸于加密假设。第12章介绍了这种模型，还探讨了有关文献中的其他相关模型。

在事先知道所有查询时，非交互式模型的准确性应该是最好的，因为它能够在了解查询结构的情况下关联噪音。相反，当事先不知道有关查询的任何信息时，非交互式查询面临着严峻的挑战，因为它必须为所有可能的查询提供答案。正如我们将看到的，为了保护隐私，甚至为了防止隐私灾难，准确性必然会随着提出的查询的数量而降低，提供所有可能查询的准确答案将是不可行的。

一种隐私机制（简称一种机制）是一个算法，它以一个数据库、数据类型的全集 \(\mathcal{X}\) (所有可能的数据库行的集合)、随机位以及（可选的）一组查询作为输入，并产生一个字符串作为输出。如果输入中存在查询，那么希望可以从输出的字符串中解码出相对准确的查询答案。如果输入中不存在查询，则属于非交互式类型，目标是得到可解释的输出字符串，为后续查询提供答案。

在某些情况下，我们可能要求输出字符串是一个合成数据库。这是一个从所有可能的数据行的全集 \(\mathcal{X}\) 中抽取的多重集 (multiset)。这种情况下的解码方法是在合成数据库上执行查询，然后进行某种简单的转换，例如乘以一个比例因子，来获得真实查询结果的近似值。

2.2 迈向隐私数据分析的定义

在数据分析的背景下，一种自然的定义隐私的方法是要求分析人员在完成数据分析后，对数据集中任何人的了解与之前相比没有增加。通过要求攻击者对个人的先验观点和后验观点（即访问数据库前后的看法）不应“差异太大”，或者通过要求对数据库的访问不应“过多”地改变攻击者对个人的看法，也可以很自然地形式化这个目标。然而，只要我们可以通过数据库学习到任何内容，这种隐私概念就是不切实际的。打个比方，假设攻击者有一个（错误的）先验观点是每个人都有两只左脚。通过访问统计数据库后发现，几乎每个人都有一只左脚和一只右脚。现在，攻击者对于任何特定的受访者是否有两只左脚的观点已经被彻底改变了。

通过前后对比或者约束“无信息泄漏”来定义隐私的吸引力在于它符合我们的直觉：如果没有泄漏个人的任何信息，那么这个人就不会因为数据分析而受到伤害。然而，“吸烟致癌”的例子表明这种直觉是错误的，其原因在于辅助信息（\(X\)先生吸烟）。

以上定义隐私的“无信息泄漏”方法与密码系统中的语义安全 (semantic security) 概念有着相似之处。粗略来讲，语义安全是指无法从密文中了解关于明文（未加密消息）的任何信息。也就是说，在阅读密文后所了解的任何有关明文的信息，都是在阅读密文之前就已知的。因此，即使存在辅助信息表明被加密的内容是“狗”或“猫”，密文无法让你更进一步分辨哪个才是被加密的内容。从形式上来说，我们可以通过比较窃听者猜测加密对象（是“狗”还是“猫”）的能力，以及所谓的对抗模拟器 (adversary simulator) 猜测相同内容的能力来对此进行建模。其中，对抗模拟器知晓辅助信息，但是无法访问密文。如果在窃听者和模拟器都知晓所有辅助信息的情况下，针对不同的窃听者，对抗模拟器猜测出正确答案的可能性都与之基本相同，那么该系统就具备语义安全的性质。当然，要使系统真正可用，合法接收者必须能够正确解密消息，否则任何无法解密消息的系统都能轻易满足语义安全的性质，但毫无实用价值。

我们知道，在标准计算假设下，存在语义安全的加密系统，那么为什么我们不能构建语义安全的隐私数据库机制，在保证各行数据隐私的同时提供查询结果呢？

首先，这个类比并不完美：在语义安全的加密系统中，存在三个参与者：消息发送着（加密明文消息）、消息接收者（解密密文）和窃听者（因无法了解任何发送前未知的明文消息而感到沮丧）。相比之下，在隐私数据分析的场景下，只有两方参与：执行隐私机制的管理者（类似于消息发送者）和数据分析人员，后者接收针对查询的响应信息（类似于消息接收者），同时也在尝试获取个人的敏感信息（类似于窃听者）。由于数据分析人员既是合法接收者，也是试图窥探隐私的攻击者，因此将其与加密进行类比存在缺陷：拒绝向攻击者提供任何信息意味着拒绝向数据分析人员提供任何信息。

其次，与加密方案一样，我们需要确保隐私方案的可用性，这意味着它能够教给分析人员一些她之前不知道的信息。这种信息对于对抗模拟器是不可获取的，也就是说，模拟器无法“预测”分析人员学到了什么。因此，我们可以将数据库视为一种（不可预测的）弱随机位来源，从中提取高质量的随机性用作随机填充 (random pad)。它可被用于一项加密技术：将机密信息添加到一个随机值（“随机填充”）中，以生成一个从信息理论角度隐藏机密信息的字符串。只有知道随机填充的人才能解密信息；任何对填充一无所知的人，无论其计算能力如何强大，都无法获知机密信息。由于数据分析人员可以访问数据库，所以他们可以学到随机填充，而没有数据库访问权限的对抗模拟器则无法获得有关填充的任何信息。因此，假设辅助信息是使用随机填充加密的机密信息，那么分析人员可以对其解密，但对抗模拟器则无法获知任何机密信息。这导致了攻击者/分析人员与对抗模拟器在获取机密信息的能力方面存在巨大差距，彻底破灭了和语义安全进行类比的可能性。

无论是在吸烟致癌的例子还是语义安全的类比中，阻碍我们实现目标的都是辅助信息。显然，为了使隐私保护具有意义，既是在存在“合理”辅助信息的情况下，它也应该依然有效。然而，如何区分合理的辅助信息本身也是一个难题。例如，使用政府数据库的分析人员可能同时也是一家大型搜索引擎公司的员工。对于此类人员可能掌握的辅助信息，我们应该如何做出“合理”的假设呢？

2.3 形式化差分隐私

我们将从差分隐私的专业定义开始，然后对其进行解读。差分隐私通过特定的过程 (process) 来提供隐私保护，特别是会引入随机性。通过随机过程保护隐私的一个早期例子是随机响应 (randomized response)，这是社会科学中为收集不当或不法行为的统计信息而开发的一项技术，不当或不法行为通过是否具备属性 \(P\) 来判定。参与研究的受访者被要求根据以下步骤报告自己是否具备属性 \(P\)：

抛掷一枚硬币。
如果是反面，则如实回答。
如果是正面，则抛掷第二枚硬币，如果是正面则回答“是”，如果是反面则回答“否”。

该技术对“隐私”的保护来自于受访者存在合理否认任何结果的可能性；特别是，如果具备属性 \(P\) 意味着参与了非法行为，即使回答“是”也无法证明受访者有罪，因为无论受访者是否具备属性 \(P\)，他们回答“是”的概率至少有 \(1/4\)。准确性 (accuracy) 来自于对噪音生成过程（随机引入伪造的答案“是”和“否”）的理解：回答 “是” 的人数的期望等于不具备属性 \(P\) 的人数的 \(1/4\) 加上具备属性 \(P\) 的人数的 \(3/4\)。因此，如果 \(p\) 表示具备属性 \(P\) 的受访者的真实比例，则回答 “是” 的预期人数为 \((1/4)(1 − p) + (3/4)p = (1/4) + p/2\)。这样，我们可以估计 \(p\) 的值为回答 “是” 的人数的两倍再减去 \(1/2\)，即 \(2((1/4) + p/2) − 1/2\)。

随机化是至关重要的一环；更准确地说，无论现在或者将来存在什么样的辅助信息（包括其他数据库、研究、网站、线上社区、流言、报纸以及政府统计等等），任何有意义的隐私保护都必须要通过随机化才能实现。这可以通过一个简单的混合论证来证明，现在我们简述如下。为了进行反证，现假设存在一个有意义的确定性算法。“有意义”意味着对于某个查询，两个不同的数据库会产生有差异的结果。通过改变某个数据库中的一行数据，可以构造出一对数据库，它们只在特定的一行数据上有不同的取值，但在相同的查询下会产生不同的结果。如果攻击者知道这两个是几乎一样的数据库，那么他就能通过观察查询结果来推断出未知行的数据值。

因此，我们需要讨论随机算法的输入和输出空间。在本书中，我们使用的是离散概率空间。有时我们会将算法描述为从连续分布中采样，但这些情况应始终以适当的方式离散化为有限精度（参阅下文中的注解2.1）。一般而言，一个定义域为 \(A\) 和（离散）值域为 \(B\) 的随机算法，将与从 \(A\) 到 \(B\) 上的概率单纯形 (probability simplex) \(\Delta(B)\) 相关联：

定义 2.1 (概率单纯形). 对于给定的离散集合 \(B\)，其上的概率单纯形 \(\Delta(B)\) 定义如下：

\[ \Delta(B) = \{ x \in \mathbb{R}^{| B |}: x_i \ge 0 \text{ for all } i \text{ and } \sum_{i=1}^{| B |}{x_i} = 1 \} \]

参考文献

S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
M.-F. Balcan, A. Blum, J. D. Hartline, and Y. Mansour. Mechanism design via machine learning. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 605–614. IEEE, 2005.
A. Beimel, S. P. Kasiviswanathan, and K. Nissim. Bounds on the sample complexity for private learning and private data release. In Theory of Cryptography, pages 437–454. Springer, 2010.
A. Beimel, K. Nissim, and U. Stemmer. Characterizing the sample complexity of private learners. In Proceedings of the Conference on Innovations in Theoretical Computer Science, pages 97–110. Association for Computing Machinery, 2013.
A. Bhaskara, D. Dadush, R. Krishnaswamy, and K. Talwar. Unconditional differentially private mechanisms for linear queries. In H. J. Karloff and T. Pitassi, editors, Proceedings of the Symposium on Theory of Computing Conference, Symposium on Theory of Computing, New York, NY, USA, May 19–22, 2012, pages 1269–1284. 2012.
A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In Chen Li, editor, Principles of Database Systems, pages 128–138. ACM, 2005.
A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the sulq framework. In Principles of Database Systems. 2005.
A. Blum, K. Ligett, and A. Roth. A learning theory approach to noninteractive database privacy. In Cynthia Dwork, editor, Symposium on Theory of Computing, pages 609–618. Association for Computing Machinery, 2008.
A. Blum and Y. Monsour. Learning, regret minimization, and equilibria, 2007.
J. L. Casti. Five Golden Rules: Great Theories of 20th-Century Mathematics and Why They Matter. Wiley, 1996.
T. H. Hubert Chan, E. Shi, and D. Song. Private and continual release of statistics. In Automata, Languages and Programming, pages 405–417. Springer, 2010.
K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the Annual Conference on Learning Theory (COLT 2011). 2011.
K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of machine learning research: JMLR, 12:1069, 2011.
K. Chaudhuri, A. Sarwate, and K. Sinha. Near-optimal differentially private principal components. In Advances in Neural Information Processing Systems 25, pages 998–1006. 2012.
Y. Chen, S. Chong, I. A. Kash, T. Moran, and S. P. Vadhan. Truthful mechanisms for agents that value privacy. Association for Computing Machinery Conference on Electronic Commerce, 2013.
P. Dandekar, N. Fawaz, and S. Ioannidis. Privacy auctions for recommender systems. In Internet and Network Economics, pages 309–322. Springer, 2012.
A. De. Lower bounds in differential privacy. In Theory of Cryptography Conference, pages 321–338. 2012.
I. Dinur and K. Nissim. Revealing information while preserving privacy. In Proceedings of the Association for Computing Machinery SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, pages 202–210. 2003.
J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. arXiv preprint arXiv:1302.3203, 2013.
C. Dwork. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP)(2), pages 1–12. 2006.
C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486–503. 2006.
C. Dwork and J. Lei. Differential privacy and robust statistics. In Proceedings of the 2009 International Association for Computing Machinery Symposium on Theory of Computing (STOC). 2009.
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference ’06, pages 265–284. 2006.
C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the limits of lp decoding. In Proceedings of the Association for Computing Machinery Symposium on Theory of Computing, pages 85–94. 2007.
C. Dwork and M. Naor. On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. Journal of Privacy and Confidentiality, 2010.
C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In Proceedings of the Association for Computing Machinery Symposium on Theory of Computing, pages 715–724. Association for Computing Machinery, 2010.
C. Dwork, M. Naor, T. Pitassi, G. N. Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In Proceedings of International Conference on Super Computing. 2010.
C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. P. Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In Symposium on Theory of Computing ’09, pages 381–390. 2009.
C. Dwork, M. Naor, and S. Vadhan. The privacy of the analyst and the power of the state. In Foundations of Computer Science. 2012.
C. Dwork, A. Nikolov, and K. Talwar. Efficient algorithms for privately releasing marginals via convex relaxations. In Proceedings of the Annual Symposium on Computational Geometry (SoCG). 2014.
C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In Proceedings of Cryptology 2004, vol. 3152, pages 528–544. 2004.
C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In Foundations of Computer Science, pages 51–60. 2010.
C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze gauss: Optimal bounds for privacy-preserving pca. In Symposium on Theory of Computing. 2014.
L. Fleischer and Y.-H. Lyu. Approximately optimal auctions for selling privacy when costs are correlated with data. In Association for Computing Machinery Conference on Electronic Commerce, pages 568–585. 2012.
A. Ghosh and K. Ligett. Privacy and coordination: Computing on databases with endogenous participation. In Proceedings of the fourteenth ACM conference on Electronic commerce (EC), pages 543–560, 2013.
A. Ghosh and A. Roth. Selling privacy at auction. In Association for Computing Machinery Conference on Electronic Commerce, pages 199– 208. 2011.
A. Groce, J. Katz, and A. Yerukhimovich. Limits of computational differential privacy in the client/server setting. In Proceedings of the Theory of Cryptography Conference. 2011.
A. Gupta, M. Hardt, A. Roth, and J. Ullman. Privately releasing conjunctions and the statistical query barrier. In Symposium on Theory of Computing ’11, pages 803–812. 2011.
A. Gupta, A. Roth, and J. Ullman. Iterative constructions and private data release. In Theory of Cryptography Conference, pages 339–356. 2012.
J. Håstad, R. Impagliazzo, L. Levin, and M. Luby. A pseudorandom generator from any one-way function. SIAM Journal of Computing, 28, 1999.
M. Hardt, K. Ligett, and F. McSherry. A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems 25, pages 2348–2356. 2012.
M. Hardt and A. Roth. Beating randomized response on incoherent matrices. In Proceedings of the Symposium on Theory of Computing, pages 1255–1268. Association for Computing Machinery, 2012.
M. Hardt and A. Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the Symposium on Theory of Computing. 2013.
M. Hardt and G. N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In Foundations of Computer Science, pages 61–70. IEEE Computer Society, 2010.
M. Hardt and K. Talwar. On the geometry of differential privacy. In Proceedings of the Association for Computing Machinery Symposium on Theory of Computing, pages 705–714. Association for Computing Machinery, 2010.
N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. Pearson, D. Stephan, S. Nelson, and D. Craig. Resolving individuals contributing trace amounts of dna to highly complex mixtures using highdensity snp genotyping microarrays. PLoS Genet, 4, 2008.
J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu. Private matchings and allocations. arXiv preprint arXiv:1311.2828, 2013.
J. Hsu, A. Roth, and J. Ullman. Differential privacy for the analyst via private equilibrium computation. In Proceedings of the Association for Computing Machinery Symposium on Theory of Computing (STOC), pages 341–350, 2013.
Z. Huang and S. Kannan. The exponential mechanism for social welfare: Private, truthful, and nearly optimal. In IEEE Annual Symposium on the Foundations of Computer Science (FOCS), pages 140–149. 2012.
P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. Journal of Machine Learning Research — Proceedings Track, 23:24.1–24.34, 2012.
M. Kapralov and K. Talwar. On differentially private low rank approximation. In Sanjeev Khanna, editor, Symposium on Discrete Algorthims, pages 1395–1414. SIAM, 2013.
S. P. Kasiviswanathan, H. K. Lee, Kobbi Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
M.Kearns.Efficientnoise-tolerantlearningfromstatisticalqueries.Journal of the Association for Computing Machinery (JAssociation for Computing Machinery), 45(6):983–1006, 1998.
M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science (ITCS), 2014.
D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional regression. Journal of Machine Learning Research, 1:41, 2012.
K. Ligett and A. Roth. Take it or leave it: Running a survey when privacy comes at a cost. In Internet and Network Economics, pages 378–391. Springer, 2012.
N. Littlestone and M. K. Warmuth. The weighted majority algorithm. In Annual Symposium on Foundations of Computer Science, 1989, pages 256–261. IEEE, 1989.
A. McGregor, I. Mironov, T. Pitassi, O. Reingold, K. Talwar, and S. P. Vadhan. The limits of two-party differential privacy. In Foundations of Computer Science, pages 81–90. IEEE Computer Society, 2010.
F. McSherry. Privacy integrated queries (codebase). Available on Microsoft Research downloads website. See also the Proceedings of SIGMOD 2009.
F. McSherry and K. Talwar. Mechanism design via differential privacy. In Foundations of Computer Science, pages 94–103. 2007.
F. McSherry and K. Talwar. Mechanism design via differential privacy. In Foundations of Computer Science, pages 94–103. 2007.
D. Mir, S. Muthukrishnan, A. Nikolov, and R. N. Wright. Pan-private algorithms via statistics on sketches. In Proceedings of the Association for Computing Machinery SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 37–48. Association for Computing Machinery, 2011.
I. Mironov. On significance of the least significant bits for differential privacy. In T. Yu, G. Danezis, and V. D. Gligor, editors, Association for Computing Machinery Conference on Computer and Communications Security, pages 650–661. Association for Computing Machinery, 2012.
I. Mironov, O. Pandey, O. Reingold, and S. P. Vadhan. Computational differential privacy. In Proceedings of CRYPTOLOGY, pages 126–142. 2009.
A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets (how to break anonymity of the netflix prize dataset). In Proceedings of IEEE Symposium on Security and Privacy. 2008.
A. Nikolov, K. Talwar, and L. Zhang. The geometry of differential privacy: the sparse and approximate cases. Symposium on Theory of Computing, 2013.
K. Nissim, C. Orlandi, and R. Smorodinsky. Privacy-aware mechanism design. In Association for Computing Machinery Conference on Electronic Commerce, pages 774–789. 2012.
K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the Association for Computing Machinery Symposium on Theory of Computing, pages 75–84. 2007.
K. Nissim, R. Smorodinsky, and M. Tennenholtz. Approximately optimal mechanism design via differential privacy. In Innovations in Theoretical Computer Science, pages 203–213. 2012.
M.PaiandA.Roth.Privacyandmechanismdesign.SIGecomExchanges, 2013.
R. Rogers and A. Roth. Asymptotically truthful equilibrium selection in large congestion games. arXiv preprint arXiv:1311.2625, 2013.
A. Roth. Differential privacy and the fat-shattering dimension of linear queries. In Approximation, Randomization, and Combinatorial Optimization, Algorithms and Techniques, pages 683–695. Springer, 2010.
A. Roth. Buying private data at auction: the sensitive surveyor’s problem. Association for Computing Machinery SIGecom Exchanges, 11(1):1– 8, 2012.
A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In Symposium on Theory of Computing ’10, pages 765–774. 2010.
A. Roth and G. Schoenebeck. Conducting truthful surveys, cheaply. In Proceedings of the ACM Conference on Electronic Commerce, pages 826– 843. 2012.
B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for svm learning. arXiv preprint arXiv:0911.5708, 2009.
R. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003.
R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 39:297–336, 1999.
R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.
A. Smith and A. G. Thakurta. Differentially private feature selection via stability arguments, and the robustness of the lasso. In Proceedings of Conference on Learning Theory. 2013.
L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicines Ethics, 25:98–110, 1997.
J. Ullman. Answering n{2+o(1)} counting queries with differential privacy is hard. In D. Boneh, T. Roughgarden, and J. Feigenbaum, editors, Symposium on Theory of Computing, pages 361–370. Association for Computing Machinery, 2013.
L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134–1142, 1984.
S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
D. Xiao. Is privacy compatible with truthfulness? In Proceedings of the Conference on Innovations in Theoretical Computer Science, pages 67–86. 2013.