15-859(B) Machine Learning Theory 03/05/12 Guest Lecturer: Liu Yang * Intro to Property Testing * Three Models of testing depending on the choice of query type (MQ/Random/Active Queries) - Testing Linearity - Testing Interval - Testing Dictator ======================================================================== Intro to Property Testing ======================================================================== In previous lectures, we've been studying PAC-learning where the goal is to find a classifier with error rate < eps. - There is a related easier problem where we just want to know whether target fn is the concept space or not (a preprocessing step for learning). to see whether you can learn with that concept space - given the concept space C, want an alg if f in C then alg takes samples, make queries, and says accept; or if far away from any classifier in C, alg takes sample, make queries, says reject. - there is a gray area between where f not in C but eps-close, then output arbitrary decisions. A bit history: property testing initially appeared as a tool for program checking (for example, linearity test [BLR)]. Say software f is supposed to compute some linear function, we want to identify bugs. Rather than querying its value 2^n times in order to be sure that f is perfect, if we found no mistakes among O(1/eps) queries, we can be convinced that f is eps-close to being linear. [GGR] defined property testing in relation to PAC learning and property testing became a new type of computational problems. ======================================================================== Relation to Learning ======================================================================== Learning outputs a good set of target function. Testing decides whether target f is in the concept space or eps-far from from any fn in the concept space. - Testing as a preliminary step of learning (can view it a model selection step in deciding whether to use a class as hypothesis class) - We are interested in testing algorithm also because it is more efficient than learning. Roughly speaking, testing is easier than learning. Provably, testing is no harder than proper learning. Want to have testing algs whose query complexity is sub-linear in the size of the object (boolean fns); at best, they are independent of the size of the object. ---------------------------------------------------------------- The standard model (using Membership query) ---------------------------------------------------------------- A (standard) testing alg for property P (of fns from X to R) is given a distance parameter eps and query access to an unknown fn f: X->R. - If f in P then the alg should accept with prob at least 2/3 - If dist(f, P) >= eps, then the alg should reject with prob at least 2/3. where dist_D(f, P) = min_{g in P} {dist_D(f, g)} and dist_D(f, g) = Pr_{x in X} [f(x) != g(x)]. Depending on the choice of Query type: - Membership query: asks the value of fn on arbitrary point of your construction. - Active query: selective (sequentially) pick the points from a natural pool of unlabeled examples to query their labels. - Passive: Randomly sample examples from the pool. we have another two variants of the standard model: Passive Testing and Active Testing. In this lecture, we will go through a few fn classes and construct testers (upper bound) and the lower bound proof of each. ---------------------------------------------------------------- Theorem: If a fn class F has a proper learning alg L (alg outputs a hypothesis from the same class of fns as the target), then F has a property testing alg T with sample complexity m_T(n, eps) = m_L(n, eps/2) + O(1/eps). Proof Sketch: (efficient proper learner => efficient tester; but it is not necessarily true that efficient improper learner => efficient tester; because it might be NP-hard to project an improper learner to a proper one) Say the proper learner outputs a classifier h in C (h is eps close to f). Use validation set of size O(1/eps) to verify that returned classifier has error rate < eps. If target f is in the concept space, and L returns h, case 1: If validation set => dist(h, f)< (3/4)eps (by Chernoff bound), accept case 2: If validation set => dist(h, f)> (3/4)eps (by Chernoff bound), reject (if dist(h,f) < eps/2, O(1/eps) samples suffice to confirm dist(h,f) < (3/4)eps by Chernoff bound; if dist(h,f) > eps, O(1/eps) samples suffice to confirm dist(h,f) > (3/4)eps). Other cases, predict arbitrarily. If f in P, L is a learning alg => dist(h, f)< eps/2 If f eps-far from P, L is proper => dist(h,f) > eps. ======================================================================== Testing Linearity (BLR tester, MQ, Uniform distribution) ======================================================================== Claim 1: A fn f is linear iff for evey x, y in {0, 1}^m, f(x)+ f(y)= f((x+y) mod 2). The prob that a single iteration detect f is not linear: eta(f) = Pr_{x, y} [f(x) + f(y) != f(x+y)] eps(f) = min_{g in L}{dist(f, g)} (L is the set of all linear fns) We direct use the following lemma: Lemma 1: eta(f) >= eps(f)/c for some constant c>0. (Define Majority fn as, for each x in n-dim Boolean cube, g(x) = 0 if Pr_y[f(x+y) - f(y) = 0] >=1/2; g(x) =1 otw. Define Vote that y casts on the value of x: V_y(x) = f(x+y) - f(y). The proof is to use technique of Self-corrector: f can be self-corrected assuming it is sufficiently close to being linear), i.e. for any x of our choice, if we want to know the value of the closest linear fn on x, we unif at random select y1, ..., yt and take the majority vote of V_y1 (x), ..., V_yt (x). The right choice of t decides the probability that the majority is correct.) [Proof sketch] To show that for every given eps >0, if eps > eps(f), Pr(reject) is larger than 2/3, we use Lemma 1, because the number of iterations >= 2c/eps, Pr(reject) > 1 - (1 - eta(f))^{2c/eps} > 1 - e^{-2 c eta(f)/eps} >= 1- e^{-2} >= 2/3. Clearly, if f is linear, then the test accepts with probability 1 (by Claim 1). QED ======================================================================== Testing Union of d Interval (Active tester, Uniform distribution) ======================================================================== Def: Fix delta >0. The local delta-noise sensitivity of the fn f: [0, 1]-> {0, 1} at x in [0, 1] is NS_delta(f, x) = Pr_{y ~delta x}[f(x) ~= f(y)], where y ~delta x represents a draw of y unif in {x - delta, x+ delta} \cap [0, 1]. The noise sensitivity of f is NS_{delta} (f) = Pr_{x, y ~delta x}[f(x)~=f(y)]. A simple lemma says that the union of d intervals has low NS. Lemma 2(easy): Fix delta >0 and let f:[0, 1] -> {0, 1} be a union of d intervals. Then NS_{delta} (f) < d*delta. [Proof sketch]: We want to calculate Pr(y is on the other side of b than x) We notice that - if no boundary near x, no way y can have a different label. - x has to be within delta distance to one of the 2d boundaries to have the possibility of f(x) != f(y). We pick a boundary b. Let variable z be the distance from x to b (z varies from 0 to delta) Since y is within delta neighborhood of x, if x is beyond the delta neighbor of b, it does not matter to us. So we get \int_0^{delta} (delta - z)/(2*delta) dz = delta^2/(2*delta) - 1/2 delta^2/(2*delta) = delta/4. Similarity for the other side, we get delta/4. Since there are at most 2d boundaries, by the union bound, we get Pr(y is on the other side of any boundary b from x)< =delta*d. Lemma 3(hard): Fix delta = eps^2/(32*d). Let f: [0, 1] -> {0, 1} be a fn with noise sensitivity bounded by NS_delta (f) <= d delta (1+ eps/4). Then f is eps-close to a union of d intervals. (the proof is a bit involved; so we skip the proof here) Given Lemma 2 and 3 are true, here is a tester that works. Union of Intervals Tester(f, d, eps) Parameters: delta = eps^2/(32*d), r= O(eps^{-3}). 1. For round i = 1, ..., r, - draw x in [0, 1] uniformly at random - draw samples until we get y in (x - delta, x+ delta) - set R_i = II(f(x) = f(y)). 2. Accept iff 1/r \sum Z_i = d*delta (1+ eps/8). The tester makes O(eps^{-3}) queries. [Correctness proof idea] By Chernoff bound, give a good estimate of noisy sensitivity of the target fn to within a factor. If the true noisy sensitivity (call it NS*) is <= d*delta; then the estimated hat(NS) <= d*delta(1+ eps/8). Otherwise if NS*> d*delta*(1+eps/4), then hat(NS) > d*delta(1+eps/8). ======================================================================== Testing Dictator fn ======================================================================== A dictator fn f: {0, 1}^n -> {0,1} such that f(x) = x_i for some i in [n]. -------------------------------------------------------------------------------------- - Dictator is locally testable -------------------------------------------------------------------------------------- In an election with n voters and 3 candidates A, B, and C, each voter submits a bit expressing his preference (say "1" means prefer A to B in comparison A to B). (1) A>B (2) B>C (3) C>A The aggregated preference is represented by (f(s1), f(s2), f(s3)). Say a triple (a, b, c) in {-1, 1} is rational, if it corresponds to a non-cycling ordering. If f is a majority fn, we can have an irrational outcome, in which three aggregated bits are either all 1 or all -1. (Condorcet Paradox). Define NAE:{-1, 1}^3 -> {-1, 1} to be true iff its three input bits are not all equal. Here we introduce NAE test: - Choose x, y, z in {-1, 1}^n by choosing the (x_i, y_i, z_i) uniformly from the set of 6 assignments s.t. all three bits are not equal, and independently across i's. - Accept if NAE(f(x), f(y), f(z)) and otherwise reject . Theorem: Dictator is locally testable. Proof: We run both BLR tester and NAE test on f, and accept iff both tests accept. Clearly, if f is dictator, it should pass both tests with prob 1, because dictator is parity. Suppose the overall test passes with prob >= 1- eps. Then BLR test passes w. p. >= 1 - eps; and so there exists a unique set S* (from Parseval) s.t. hat(f)(S*)>= 1- 2*eps, assuming eps is small. Also we have the NAE test passes w.p. at least 1- eps, so \sum_{|S| = 1} hat(f)(S)^2 >= 1- 9/2 eps. Thus |S| =1, otherwise \sum_{S} hat{f}(S)^2 > = 1- 9*eps /2 + (1- 2 eps)^2 >1. Thus the correlation of f and the ith coordinate is at least 1 - 2 eps, and so f is eps-close to a dictator. QED -------------------------------------------------------------------------------------- - Lower bound: Omega(log n) for passive/active testing dictator fn -------------------------------------------------------------------------------------- Theorem: For testing alg with access to poly(n) unlabeled examples, #label requests needed for active testing dictator fn is Theta(log n). The lower bound applies even for distinguishing dictators from random fns. Remark: This result has deep implications in the sense that any fn class that contains dictators as a special case (not so large as to contain almost all fns) requires at least Omega(log n) queries to test in passive/active model. This applies to many classes studies in learning: ltf, decision tree, and fns of low Fourier degrees k-junta, DNFs. [Proof idea]: Consider K iid labeled examples where the target is sampled from some prior distribution. - case 1: Unif among dictators - case 2: random noise We want to show the total variation distance between the two cases is less than a quantity w. p. 1 - e^{-blar}. So if K = o(log n) then this goes to 1 and the two cases are indistinguishable. For active tester, a union bound implies, if we have poly(n) unlabeled examples, any active tester making K queries depends on at most 2^K of their values; so if this total variation distance is small for all 2^K tuples, from the sample, active testing can not be effective. Union bound implies, this happens w. p. 1 - (poly (n) \choose 2^K) e^{-blar}. If K= o(log n), this prob goes to 1. [Didn't have time to cover in the class; but here is the formal proof of the lower bound.] Let D be the uniform distribution over X = {0, 1}^n and let Z = {x1, x2,..., x_{poly(n)}} be a dataset sampled iid from D. In particular, we will show that with o(log n) labels, one cannot even distinguish a random dictator fn from a uniform random fn. Specifically, consider a prior h1 ~ Unif({e_1, e_2, ..., e_n}) uniform over the n single-variable fns, and prior h2 uniform over all 2^{2^n} boolean fns. Below, we denote by Phi(x) the conditional distribution on h_i(x) in {0, 1}^k given the (random) vector x in X^k. Let ||P1 - P2|| to denote variation distance. The following lemma states that for K = o(log n) uniformly drawn random examples, with high probability all labelings are nearly equally likely under prior h1. Lemma 4: For any natural number K, any eps in (0, 1), for x ~ D^K, w.p. at least 1 - c1 * e^{c2(K- 2^{-K} n eps^2)}, \forall y in {0, 1}^K, (1 - 2*eps) * 2^{-K} <= P_{h1(x)} ({y}) <= (1 + 2 eps) 2^{-K} [call this eqn(*)] where c1 > 0 and c2 > 0 are constants independent of K. [Proof of Lemma 4]. For a given K data points x, and any y in {0, 1}^K, for each j in {1, 2, . . . , n}, let I_j(x, y) be the indicator that x_{ij} = y_i for all i; in other words, that the jth ÒcolumnÓ in the examples matches vector y. Note that P_{h1(x)}({y}) = n^{-1} \sum_{j=1}^n I_j(x, y). Fixing y, since the values {x_{ij}}_{i,j} are independent, the I_j(x, y) indicators are independent over values of j as well, and each has expectation 2^{-K}. By a Chernoff bound, P(\sum_{j=1}^n I_j(x, y) > (1 + 2*eps) * 2^{-K} *n) < exp{- 2^{-K}*n*4*eps^2/3}. P(\sum_{j=1}^n I_j(x, y) < (1 - 2*eps) * 2^{-K} *n) < exp{- 2^{-K}*n*2*eps^2}. The lemma then follows by a union bound over all 2^K choices of y. QED Lemma 5: For any natural number K, and any eps in (0, 1), Pr(x ~ X^K: ||P_{h1}(x) - P_{h2}(x)|| > eps) <= c_1 e^{c2(K-2^{-K}*n*eps^2)} where c1 > 0 and c2 > 0 are constants independent of K. [Proof of Lemma 5]. Assume x satisfies eqn(*). We have, ||P_{h1(x)} - P_{h2(x)}|| = (1/2) \sum_{y in {0,1}^K} |P_{h1(x)}({y}) - P_{h2(x)}({y})| = (1/2) \sum_{y in {0,1}^K} |\sum_{j=1}^n I_j(x, y) - 2^{-K}|/n <= (1/2) \sum_{y in {0, 1}^K} 2*eps* 2^{-K} = eps. QED. Theorem. The sample complexity of passive testing of dictator fns against random noise is Omega(log(n)). [Proof]. Suppose we have an alg based on a number of examples bounded by some K. Assume Pr(reject h2)>=1/2, i.e. with probability at least 1/2 the alg rejects random noise (required in order to be a correct alg). Lemma 5 indicates that setting the label of any point x_i (i <= k) randomly is not too different from setting it according to h1. Specifically, Pr(x ~ X^K : ||P_{h1}(x) - P_{h2} (x)|| > eps) <= c1*e^{c2(K-2^{-K} *n *eps^2)}. For any K < log(n eps^2/ log(n)) = Omega(log(n)), this converges to 0 as n goes to infinity. In particular, since convergence in probability implies convergence in expectation, this implies lim_{n -> \infty} E[||P_{h1}(x) - P_{h2}(x)||] = 0. Thus lim_{n -> \infty} Pr(reject h1) >= lim_{n -> \infty} Pr(reject h2) - lim_{n -> \infty} E[||P_{h1}(x) - P_{h2}(x)||] >= 1/2. This implies that for any alg with Pr(reject h2) > 1/2, for sufficiently large n there exists a classifier h among {e_1, ... , e_n} \subset C such that Pr(reject h) > 1/4, so that it is not a correct testing alg. QED We now show in active testing, it is not enough to make fewer than some Omega(log(n)) number of queries as n goes to infinity. For our purposes, a testing alg is ÒcorrectÓ if it accepts any h in C with probability > 3/4 and rejects random noise probability > 3/4. Lemma 6: For sufficiently large n, and a value K = Omega(sqrt(n/log(n))), w. p. at least 5/6 (over X), ||P_{(z/sqrt(n))|X} - N(0, I)|| < 1/5. Theorem. For testing algs with access to poly(n) unlabeled examples, the number of label requests needed for actively testing dictator fns is Theta(log n). The lower bound applies even for distinguishing dictators from random fns. [Proof]. Suppose we have an alg with #queries bounded by some K. Assume Pr(reject h2) >= 1/2, i.e. with prob >= 1/2 the alg rejects random noise (required in order to be a correct alg). Lemma 6 indicates that setting the label of any point x_i (i <= k) randomly, is not too different from setting it according to h1. Specifically, with probability at least 1 - c1 á e6{c2(K-2^{-K}n eps^2)}, \forall y in {0, 1}^K, (1 - 2*eps)P_{h2(x)}({y}) = (1 - 2*eps) á 2^{-K} <= P_{h1(x)}({y}) <= (1 + 2*eps)*2^{-K}= (1 + 2 eps) P_{h2(x)}({y}). There are at most (poly(n))^K possible sequences of K queries out of the poly(n) points. By a union bound, w. p. at least 1 - c1*poly(n)^K á exp(c2(K - 2^{-K} *n*eps^2)) >= 1- exp(c3 * K * log(n) - c4*2^{-K}*n*eps^2), for some appropriate c3, c4 >= 1, we have \forall (x_{i1} , . . . , x_{iK}) in Z^K, \forall y in {0, 1}^K, P_{h1(x)}({y}) <= (1 + 2*eps) * P_{h2(x)}({y}). For K < c3^{-1} log(c4 *n* eps^2/log^2(c4*n)) = Omega(log(n)), exp(c3*K*log(n) - c4*2^{-K} * n * eps^2) converges to 0 as n goes to infinity. In particular, taking eps = 1/10, and any delta > 0, for any sufficiently large n, with probability at least 1 - delta, \forall (x_{i1} ,..., x_{iK}) in Z^K, \forall y in {0, 1}^K, P_{h1(x)}({y}) <= (6/5) * P_{h2(x)}({y}). Now taking delta sufficiently small, and applying a slight generalization of Lem. 8.3 of [Fischer, 2001] which is widely used for proving lower bounds in property testing, any tester comparing a randomly drawn dictator to the random noise fn must use at least K queries to guarantee success with probability greater than 3/4. QED