> !`!r]H!;ʟ`@xcdd``dd``baV d,FYzP1n: B@?b 8bqC0&dT208 01d++&1}bF:
g0R`c8VmL ~(T@}`@*4\a
0_?@W&0;EqqOp{`fntb
Rg3
c̶\1v0o8021)Wx ],0`qg`!4oMr~`xڕSKP6M
!qWT"h
ZMQ*b<p}w0e)10
D1[r7)OL,}6@`Z40Ye{iF)MSc/V^LyM#Fcb}HEq>S=T_;{a&Ո5+e\X< z~N=zuޥ]#wH/)
_W$oD۪%QsdC]P{]RR%&.툾s&8?jBgmkcܜeߤpƃK{uz`E%nU@F[qX Zϑ^$"E`!Fv
HTzNvHxڕR=K`4WA;qE~Pu
ZDMpCAS.(ߏ$oxy.w`(^0 l#BRa1Ҭ/G}clخXmD>9جkAx$#ʰ=]hz(X@:gT$ڲkhÌXxߟ4/P'krGvl,D5}(:1jOGf^]cԮh'5mwUvy9J!}IU}&}=g9w:Dk G;tմݡ.:ôCKz={l7Щ02ݓ R]]J`Fj)msDSnj{>gf4(T7ww(`!`Anx;*;2 *qxڕUMhA~ofl6Y4ZAAERT! EY /)%7^zx존ggS첻7{/(A2DY1t]aıh,y9ҡ/'
:8t9\;Zf 'qrp}uo6IsOJV`~>/3~X~=O;<垇ag>XF3 u^K̋:g [IJR3%33yΨo9P:x5XsU$(>RVc*(^*MNoFu(5}߁54ԚލZۻB58r{)EgqEՂ?ьььь0+7r¯ęou`a+)3!3c
{3f8PT])QK
Z\xRԢP{l$ηUB<ޢܸG3G3G3G3#zs2O{߯&qNNZğN>HٖO=WzڢzTb'8$ȹ]HQjlK** d(0
2=
rKz Equation Equation.30,Microsoft Equation 3.00uL1 Equation Equation.30,Microsoft Equation 3.00wT1 Equation Equation.30,Microsoft Equation 3.00\2 Equation Equation.30,Microsoft Equation 3.0/00DTimes New Roman$ !0hz0DComic Sans MSn$ !0hz0B DSymbolans MSn$ !0hz00Dcmsy10ans MSn$ !0hz0"a.
@n?" dd@ @@`` X% R %&)\
#sK`H;
2 4 D"##+
!$*
'_2$]H!;ʟz $ 2$4oMr~z 2$Fv
HTzNX 2$`Anx;*;1 c$@G4IdId!0p ppp p
p
p@pp`ppp0ppPppp<4!d!do0D<4BdBdo0Duʚ;!{z6ʚ;<4dddd0*"___PPT9/01N8(?(?O=Q#15853:Algorithms in the Real WorldAIndexing and Searching III
Link Analysis
Near duplicate removal,''Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !WJ
Link AnalysisvThe goal is to rank pages.
We want to take advantage of the link structure to do this.
Two main approaches:
Static: we will use the links to calculate a ranking of the pages offline (Google)
Dynamic: we will use the links in the results of a search to dynamically determine a ranking (Clever hubs and authorities)NlWMv91The Link GraphView documents as graph nodes and the hyperlinks between documents as directed edges.
Can give weights on edges (links) based on
position in the document
weight of anchor term
# of occurrences of link
& ,J" J:2The Link MatrixAn adjacency matrix of the link graph.
Each row corresponds to the OutLinks of a vertex
Each column corresponds to the Inlinks of a vertex
Often normalize columns to sum to 1 so the dotproduct with self is 1.
VIThe Link Matrix
;3Google: Page Rank (1)Goal: to give a total order (rank) on the importance of all pages they index
Possible Method: count the number of inlinks.
Problems?LJ
<4Google: Page Rank (2)Refined Method: weigh the source by its importance.
Seems like this is a self recursive definition.
Imagine the following:
Jane Surfer, has a lot of time to waste, and randomly selects an outgoing link each time she gets to a page.
She counts how often she visits each page.:ZVXKSide Topic: Markov ChainsVA discrete time stochastic process is a sequence of random variables {X0, X1, & , Xn, & } where the 0, 1, & , n, & are discrete points in time.
A Markov chain is a discretetime stochastic process defined over a finite (or countably infinite) set of states S in terms of a matrix P of transition probabilities.
Memorylessness property: for a Markov chain
Pr[Xt+1 = j  X0 = i0, X1 = i1, & , Xt = i] = Pr[Xt+1 = j  Xt = i]
FZ %<c Q
N$ YLSide Topic: Markov ChainsLet pi(t) be the probability of being in state i at time step t.
Let p(t) = [p0(t), p1(t), & ] be the vector of probabilities at time t.
For an initial probability distribution p(0), the probabilities at time t are
p(n) = p(0) Pn
A probability distribution p is stationary if p = p P (i.e. is an eigenvector of P with eigenvalue 1)VB[,
3,XZMSide Topic: Markov ChainsA markov chain is irreducible if the underlying graph (of P) consists of a single strong component (i.e. all states are reachable from all other states.
A period of a state i is the integer t such that Pnii = 0 for all n t, 2t, 3t, &
(i.e. only comes back to itself every t steps).
A state is aperiodic if it has period 1.
A Markov chain is aperiodic if all states are aperiodic.
A state is recurrent if upon entering this state the process will definitely return to the state~+L ' ) Mb\
&
c[NSide Topic: Markov ChainsVFundamental Theorem of Markov Chains:
Any irreducible, finite, and aperiodic Markov chain has the following properties:
All states are ergodic (aperiodic and recurrent)
There is a unique stationary distribution p > 0 for all states.
Let N(i,t) be the number of times the Markov chain visits state i in t steps. Then limt ! 1 N(i,t) = pi tx" %S&o
PC
:
=5Google: Page Rank (3)Remaining problem: Gives too high of a rank to the Yahoo s of the world, and too low to clusters with low connectivity into the cluster.
Solution: With some probability jump to a random page instead a random outgoing link.@xL>6Google: Page Rank (4)lWant to find p, such that Tp = p
This corresponds to finding the principal eigenvector.
Methods:
Power method: iterate pi+1 = Tpi until it settlespick p0 randomly, decomposing p0 into eigenvectors a1 e1 + a2 e2 + & we have pi = l1i a1 e1 + l2i a2 e2 + &
Multilevel method: solve on contracted graph, use as initial vector for power method on the expanded graph.
Lancoz method: diagonalize, then iterate
All methods are quite expensiveaZ6Z Z
8 \
Pb0?7Hubs and Authorities (1)Goal: To get a ranking for a particular query (instead of the whole web)
Assume: We have a search engine that can give a set of pages P that match a query0ZEM@8Hubs and Authorities (2)For a particular query we might want to find:
The authorities on the topic (where the actual information is kept)
The hub sites that point to the best information on the topic (typically some kind of link site)
Important authorities will have many pages that point to them
Important hubs will point to many pages
But: As with pagerank, just having a lot of pointers does not mean much h." .fEA9Hubs and Authorities (3)vector h is a weight for each document in D for how good a hub it is
vector a is a weight for each document in D for how good an authority it is
T is the adjacency matrix
Now repeat:
h = TTa
a = Th
until it settles.$,B:Hubs and Authorities (4)What is this process doing?
ai+1 = (TTT) ai
hi+1 = (TTT) hi
Power method for eigenvectors of TTT and TTT
SVD(T) : first column of U and first row of V$1,mC;Hubs and Authorities (5)RProblem (Topic Drift): Hubs and Authorities will tend to drift to the general, instead of specific.
e.g. you search for Indexing Algorithms which identifies all pages with those terms in it, and the back and forward pages.Now you use the SVD to find Hubs + AuthoritiesThese will most likely be pages on Algorithms in general rather than Indexing Algorithms.
Solution: weight the documents with indexing prominently in them, more heavily.
a = TWh, h = TTWa, where W is a diagonal matrix with weights on the diagonal.
Same as finding SVD of T = TWL*ZSY^,\D<
Extensions4What about second eigenvectors?
Mix terms and links.E=Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !F> Syntacting Clustering of the WebProblem: Many pages on the web are almost the same but not exactly the same.
Mirror sites with local identifier and/or local links.
Spam sites that try to improve their pagerank
Plagiarized sites
Revisions of the same document
Program documentation converted to HTML
Legal documents
These near duplicates cause web indexing sites a huge headache.@M@F@G?Syntactic ClusteringIGoal: find pages for which either
Most of the content is the same
One is contained in the other (possibly with changes
Constraints:
Need to do it with only a small amount of information per document: a sketch
Comparison of documents should take time linear in the size of the sketch
Should be reasonably robust against adversary
"U"
UFxH@Syntactic Clustering
Any ideas?IAwShingling~View each shingle as a value.
Sw(A) : the set of wshingles for document A.
As shorthand, I will often drop the w subscript.$ ^^JBResemblance and ContainmentResemblance:
Containment:MC#Calculating Resemblance/ContainmentMethod 1:
Keep all shingles and calculate union and intersection of the shingles directly.
Problem: can take Dw space per document (larger than original document)
Other ideas?&
ODUsing Min Valued ShinglesHow about viewing each shingle as a number and selecting the 100 minimum valued shingles?
e.g. append the ASCII codes
Possible problem?6ZZQEHash}How about a random hash?
i.e. for each shingle x take h(x), then pick the minimum k values.
Sk(A) = mink{h(x) : x 2 S(A)}H~a
`RF!Some Theory: Pairwise independent
Universal Hash functions: (Pairwise independent)
H : a finite collection (family) of hash functions mapping U ! {0...m1}
H is universal if,
for h 2 H picked uniformly at random,
and for all x1, x2 2 U, x1 x2
Pr(h(x1) = h(x2)) 1/m
The class of hash functions
hab(x) = ((a x + b) mod p) mod m
is universal (p m is a prime, a = {1& p1}, b = {0& p1})1^H!<= ,*, ZSG Some Theory: Minwise independent
Minwise independent permutations:
Sn : a finite collection (family) of permutations mapping {1& n} ! {1& n}
H is minwise independent if,
for p 2 Sn picked uniformly at random,
and for X {1& n}, and all x 2 X
Pr(min{p(X)} = p(x)) 1/X
It is actually hard to find a compact collection of hash functions that is minwise independent, but we can use an approximation.h#gJ ?)ZJ.UHBack to similarityIf p 2 Sn and Sn is minwise independent then:
This suggests we could just keep one minimum value as our sketch , but our confidence would be low.
What we want for a sketch of size k is either
use k p s,
or keep the k minimum values for one p
3P+/8NP ` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dd@$z?" dd@ " @ ` n?" dd@ @@``PP @ ` `p@@B:(
6H 0
T Click to edit Master title style!
!
0
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0 ``
X*
00 `
f*15853
0 `
lPage *H
0h ? 3f Default Design0x (
x
x
N4m
kk
p*
J%%JJnn
x
N
kk
r*
J%%JJnnd
x
c$ ?J
4
x
Np/
kk *
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
x
T0
kk T 0
p*
J%%JJnn
x
T
kk T0
r*
J%%JJnnH
x0Bj ? ̙338P(
Nukk
x*J%%JJnn
N
kk
z*J%%JJnn
TȉLkk T 0
x*J%%JJnn
T
kk T0
z*J%%JJnnH
0Bj ? ̙330(
l
C
l
C
`
H
0h ? ̙33h
@(
x
c$
0
x
c$T
p
H3? H
0h ? 3f
P0$(
0r
0 S
0
r
0 S
H
00h ? 3fw
'`
(
x
c$
0
c$P
00
" @`Pp
6pp
$
5d1
6 Y
4d2
6
4d3
6
'
Xd4
6t
pi
5d5
6
Ym
4d6
60
@
@
4d7
6
0
4d8
s*0
5d9
db
@<G*H`=I#
'^b
6G0*HIs
W^b
6G0*H{yIW wdb
@<G0*HIdb
<ZGI1HIQ'$
^b
6G*H>
I<g
db
<G*HwIm ^b
6GrHaIY =g
^R
6GGlHZIGl
gdR
<G"6HI"6=idb
@<GHImpdR
<G}~H_I}~=
i@db
@<ZG*H:{IP db
<GTHI9
Wdb
<G!HI"g
YW^r
@6G H]I/.'$
H
0h ?
3fa
p
(
x
c$
0
x
c$
0
6
F
V
5d1
6
P'=
4d2
6
P
@g=
4d3
6x
@
Xd4
6
7j
5d5
6
'
4d6
6h
4d7
6
`M
4d8
s*4
P
0W
5d9
db
@<G*H`=I# T
^b
6G0*HIsV0^b
6G0*H{yIVdb
@<G0*HID`db
<ZGI1HIQM
D
^b
6G*H>
I<
W@db
<G*HwIP
^b
6GrHaIY 7g
^R
6GGlHZIGlj'dR
<G"6HI"6Pdb
@<GHI@dR
<G}~H_I}~T
db
@<ZG*H:{IP=P
db
<GTHI9 F
db
<G!HI"F
^r
@6G H]I/.W
'
H
0h ?
3fV
4V,V,S(
,x
, c$?
0
,
6
F
V
5d1
,
6
P'=
4d2
,
6
P
@g=
4d3
,
6(
@
Xd4
,
6
7j
5d5
,
6
'
4d6
,
6
4d7
,
6 B
`M
4d8
,
s*E
P
0W
5d9
db
,@<G*H`=I# T
^b
,6G0*HIsV0^b
,6G0*H{yIVdb
,@<G0*HID`db
,<ZGI1HIQM
D
^b
,6G*H>
I<
W@db
,<G*HwIP
^b
,6GrHaIY 7g
^R
,6GGlHZIGlj'dR
,<G"6HI"6Pdb
,@<GHI@dR
,<G}~H_I}~T
db
,@<ZG*H:{IP=P
db
,<GTHI9 F
db
,<G!HI"F
^r
,@6G H]I/.W
'
F~ pp1
,#"6*
A
o,
<R
?
8p1
I0 @`
n,
<([
? 8
1
I0 @`
m,
<,\
?8 1
I0 @`
l,
<b
?81
I0 @`
k,
<dj
?+81
I0 @`
j,
<q
?P8+1
I0 @`
i,
<y
?u8P1
I1 @`
h,
<
?8u1
I0 @`
g,
<l
?81
I0 @`
f,
<L
?
?
p8
I1 @`
e,
<
? ?
8
I0 @`
d,
<
??
8
I0 @`
c,
<
??
8
I0 @`
b,
<
?+?
8
I0 @`
a,
<
?P?
+8
I0 @`
`,
<
?u?
P8
I0 @`
_,
<$
??
u8
I0 @`
^,
<
??
8
I1 @`
],
<<
?
F p?
I0 @`
\,
<
? F
?
I0 @`
[,
<T
?F ?
I0 @`
Z,
<
?F ?
I0 @`
Y,
<l
?+F ?
I0 @`
X,
<$
?PF +?
I0 @`
W,
<d
?uF P?
I0 @`
V,
<
?F u?
I0 @`
U,
<L
?F ?
I1 @`
T,
<x
?
MpF
I0 @`
S,
<%
? M
F
I0 @`
R,
<l
?M F
I0 @`
Q,
<.
?MF
I0 @`
P,
<;
?+MF
I0 @`
O,
<DC
?PM+F
I0 @`
N,
<pD
?uMPF
I0 @`
M,
<dR
?MuF
I0 @`
L,
<Y
?MF
I1 @`
K,
<Z
?
TpM
I0 @`
J,
<h
? T
M
I0 @`
I,
<p
?T M
I1 @`
H,
<Hq
?TM
I0 @`
G,
<<
?+TM
I0 @`
F,
<
?PT+M
I1 @`
E,
<
?uTPM
I0 @`
D,
<
?TuM
I1 @`
C,
<
?TM
I0 @`
B,
<
?
[pT
I0 @`
A,
<
? [
T
I0 @`
@,
<`
?[ T
I0 @`
?,
<
?[T
I0 @`
>,
<
?+[T
I0 @`
=,
<
?P[+T
I0 @`
<,
<
?u[PT
I0 @`
;,
<
?[uT
I0 @`
:,
<8
?[T
I0 @`
9,
<d
?
bp[
I0 @`
8,
<X
? b
[
I0 @`
7,
<
?b [
I0 @`
6,
<
?b[
I1 @`
5,
<
?+b[
I1 @`
4,
<
?Pb+[
I0 @`
3,
<
?ubP[
I0 @`
2,
<D
?bu[
I0 @`
1,
<d$
?b[
I0 @`
0,
<D,
?
ipb
I1 @`
/,
<3
? i
b
I0 @`
.,
<;
?i b
I0 @`
,
<B
?ib
I0 @`
,,
<$D
?+ib
I1 @`
+,
<K
?Pi+b
I0 @`
*,
<
I<
W@db
<G*HwIP
^b
6GrHaIY 7g
^R
6GGlHZIGlj'dR
<G"6HI"6Pdb
@<GHI@dR
<G}~H_I}~T
db
@<ZG*H:{IP=P
db
<GTHI9 F
db
<G!HI"F
^r
@6G H]I/.W
'
H
0h ?
3fS
(
x
c$
0
x
c$x
`
q
8 '4
'4
6J
F
V4
5d1
6$
'
4d2
6P
@ g
4d3
6T
7
Xd4
6L
7
5d5
6`
'}
4d6
6
P
=
4d7
6
4d8
s*
0 W
5d9
lb
B<G*H`=I#T
=
fb
6G0*HIsV 0gfb
6G0*H{yIVglb
B<G0*HI
Dlb
<ZGI1HIQ
D4fb
6G*H>
I<Ww @ lb
<G*HwI} fb
6GrHaIY 7Mgw fR
6GGlHZIGl'wlR
<G"6HI"6Mlb
B<GHIlR
<G}~H_I}~T
MPlb
B<ZG*H:{IP lb
<GTHI9F
glb
<G!HI"F
wgfr
B6G H]I/.
'4
<.
m
6,$D0
:Problem? H
0h ?
3f
8$(
8r
8 S_
0
r
8 S
H
80h ? 3f
<$(
<r
< S
0
r
< S,
H
<0h ? 3f
@0(
@x
@ c$H
0
x
@ c$t
H
@0h ? 3f
D:(
Dr
D Sr
0
D S
" @`PpH
D0h ? 3f
E(
x
c$
0
x
c$ȍ
`
%
6
`
Stationary probability of a Markov Chain.
Transition probability matrix is simply matrix with equal probability of leaving on each link*^
6
0
5!Transition probability matrix is:f
s*A??
F
6@
@
iUU is the uniform matrix, and A is the normalized link matrix (each column sums to 1).H
0h ? 3f
0(
x
c$h
0
x
c$p
H
0h ? 3fi
!(
x
c$
0
x
c$
68
4d1
6Ⱥ
m
4d2
s*

4d3
6
4f1
6
c P
4f2
s*
4f3
s*̮
4f4
6`
4b1
6
}
4b2
s*
=*
4b3
s*
4b4L
c$T
L
c$TL
c$D
L
c$
L
c$
L
c$d
L
c$D d
L
c$
L
c$
L
c$'D L
c$D 7
L
c$7
<P
:back set
<s
?0
=forward set
<n
(
e
Dsearch
results
P(
0@a
tFind all pages that point to P (back set) and that P point to (forward set).
Include b, d and f in a document set D.8uZ(H
0h ?
3f
F(
x
c$
0
c$;
" @`PpH
0h ? 3f_
0 (
x
c$
0
x
c$x
6
#'
4f1
6h
#>{
4f2
s*
;#J(
4f3
s*
*Q
4f4
6p
jx
4h1
6p
jx
4h2
s*@
8j%
4h3
s*H
q
4h4L
c$Ox#L
c$Ox*b L
c$R#L
c$#L
c$#L
c$_ *b L
c$#_
<
F
@
6hubs
<
;
: 5
=authoritiesH
0h ?
3fv
&@(
x
c$
0
x
c$
`
~
04
p
H
0h ? 3f
P0(
x
c$
0
x
c$<
H
0h ? 3f
`0(
x
c$
0
x
c$
H
0h ? 3fh
p(
x
c$T
0
x
c$
p
H3?
H
0h ? 3f
$(
r
S
0
r
S
0
H
0h ? 3f
:(
r
S
0
S
" @`PpH
0h ? 3f
$(
r
S$
0
r
S`
H
0h ? 3fG
(
r
S%
0
r
S&
p
X
0
vBC0DEF00 @p@
BC0DEF00 @p
BC0DEF00 @P0
BC0DEF00 @0`
BC0DEF00 @
BC0DEF00 @pP LB
c$DppLB
c$DPP
<+

1wRB
s*DPRB
@
s*Dp0H
0h ? 3fh
(
r
S4
0
r
S$5
`
c$Ar??r`
c$Au??N
u
66
@
@
3A
6:
` @
3BLB
c$D@ LB
c$D`@@ ^r
6Z @^r
6Z
@
<
}F
1c
<@
F+
1d
<lC
}
?c
d
<@H
3R =
6PK
P
3A
6N
3BLB
c$D
LB
c$D^r
6Z3
P
^r
6Z!
P
<Q
1f
<`U
P
fp
1e
<X
kH
?e
f
<\
` 8
3C =H
0h ? 3f`
(
r
SX
`0
r
Sq
0r
p,$D0
AFirst 100 shingles?x
0u
`
,$D0
Shingles from random positions (preselected)? e.g. 23, 786, 1121, 1456, & KK
0x
@,$D0
!Shingles from every kth position?$"H
0h ? 3f
$(
r
S
0
r
S
H
0h ? 3f
$(
r
SPi
0
r
Sc
H
0h ? 3f
0 $(
r
S
0
r
S,
`
H
0h ? 3f
$0(
$x
$ c$
0
x
$ c$
H
$0h ? 3fL
@((
(r
( S
0
r
( S@
`
(
c$Aw??@wH
(0h ? 3fHC0(
^
SxJ
c$D
x*
XOther ideas: choosing firs 100 shingles. Choosing a random subset. Choosing every kth.TH
0Bj ? ̙33 D0p(
^
SxJ
c$
x*
fRThe minimum values might correspond to common idioms, e.g. a bunch of blank spacesH
0Bj ? ̙33xKQpUoM
K4r
? uՖn Cx/c'cP2}&HMZL~3W*FQ9֏0~#̭obwMlz~xQJ?;tTaǋsG[?7q&`NydM_0I%Ga
LYōvܷK+/E~і1tOKP'Od1P,3'e*?pm*wܫtOm_u`U7j&;εqrb)
ƔB:.wW:Ҷ+IY2Gl:T'Tn)~ActG55Uf9*ЋG(l7}xMHA͛rեL<dP2PGu&J;Hn]
A٥2֛Hٍ
2.Fs2^S_ϏdWԹк䥊\(s&RjEѡ)CBR\B9"\䓡R}PϜ8 D6S,)N)ԭ0y~z';W T.0E wˑ)*\)0W
:2EEpylUDܿp<ߓO9Bg bvw";oaBksdr_
MmL}ܪw^SoczPu9ϔ}~Sf5,MqL%2ou3%VfCp~+P`!EM@ft?L$ab!hL0{{oՂ!y9y9nahPCX
U1QtX,%BCe".}?Ƿ.l$}YB$MC(,4Qyfr<ط[e/x[ϑSGqȧ<:z
`HC* sr_oL2?Т_5Rr_"MB/p9a
a+@@=5mQ26+ "paPAjkǖT
~lqҽ/j`st2ܖ
fcDp]W,",/ZJV~E=kCe:%xT&ҫSX }kX_)*ZlO<Kjl5Cn^e
GdSF?%aV{=n̯~B՞\YņL؋x=P,v.!1.Y}8]4~(N㝫(^x7ztgN㽱VHrT^"Qc
N^=;'Ǝ7c̐KqC7a]oRsZ(@9mţ`t<%L@9
[$+ƤT75Y5huV~aѮF~5
yw^EWE8aDԐ8^egvnaK"݁ti.\\$s!mЫe.zMC"6vwU^c
t0`o~+?v0h]?^nwίVS25\nf!czA2VI3U>Ț܊.Oi>ݲSśL.oqnz}*,d#dqA MrքCyCvN)N&m}]¡z]sDL
@`*
"r"DogNLG@Ԇ$;A8Ip*$)#5IÙPG%sc{.tTiΚE9x1 Lb
G0(8>9'(p
q'>3>O>"/܋_
·q?Ńx>>Q^~%ׯSc~)h'sJ'Exp^RЀ3ÿlHbPLPÄ!14axĐ%
y@(XyT0cH!58{τC@v#LgaH%NdĈb0w?$,
ORe3ATb0$H@B 7XgGj3D
J3S3$}q ᛙ\_VZXX`gr 1p01MQ@o {A际<(V40)ClZ25LML@HIAaV4(hsQ;e^BwY
]
J%&~P^Q0ZqHB"SePRr0dy2m={t{9@vk"'):<GzJrLjNPRTV_csl{q;~suwyߠ~/If\d(0
2=
rKz Equation Equatio
$ Onscreen Showcarnegie mellon universityr 2 &Times New RomanComic Sans MSSymbolcmsy10Default DesignMicrosoft Equation 3.0$15853:Algorithms in the Real WorldIndexing and Searching OutlineLink AnalysisThe Link GraphThe Link MatrixThe Link MatrixGoogle: Page Rank (1)Google: Page Rank (2)Side Topic: Markov ChainsSide Topic: Markov ChainsSide Topic: Markov ChainsSide Topic: Markov ChainsGoogle: Page Rank (3)Google: Page Rank (4)Hubs and Authorities (1)Hubs and Authorities (2)Hubs and Authorities (3)Hubs and Authorities (4)Hubs and Authorities (5)ExtensionsIndexing and Searching Outline!Syntacting Clustering of the WebSyntactic ClusteringSyntactic ClusteringwShinglingResemblance and Containment$Calculating Resemblance/ContainmentUsing Min Valued ShinglesHash"Some Theory: Pairwise independent!Some Theory: Minwise independentBack to similarityFonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles $_n
Guy BlellochGuy Blelloch .0,Microsoft Equation 3.00\2 Equation Equation.30,Microsoft Equation 3.0/00DTimes New Roman$ `H!0`hz0DComic Sans MSn$ `H!0`hz0B DSymbolans MSn$ `H!0`hz00Dcmsy10ans MSn$ `H!0`hz0"a.
@n?" dd@ @@`` X% R %&)\
#sK`H;
2 4 D"##+
!$*
'_2$]H!;ʟz $ 2$4oMr~z 2$Fv
HTzNX 2$`Anx;*;1 c$@G4IdId!0Tp ppp p
p
p@pp`ppp0ppPppp<4!d!do0<4BdBdo0uʚ;!{z6ʚ;<4dddd0l*"___PPT9/01N8(?(?O=Q#15853:Algorithms in the Real WorldAIndexing and Searching III
Link Analysis
Near duplicate removal,''Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !WJ
Link AnalysisvThe goal is to rank pages.
We want to take advantage of the link structure to do this.
Two main approaches:
Static: we will use the links to calculate a ranking of the pages offline (Google)
Dynamic: we will use the links in the results of a search to dynamically determine a ranking (Clever hubs and authorities)NlWMv91The Link GraphView documents as graph nodes and the hyperlinks between documents as directed edges.
Can give weights on edges (links) based on
position in the document
weight of anchor term
# of occurrences of link
& ,J" J:2The Link MatrixAn adjacency matrix of the link graph.
Each row corresponds to the OutLinks of a vertex
Each column corresponds to the Inlinks of a vertex
Often normalize columns to sum to 1 so the dotproduct with self is 1.
VIThe Link Matrix
;3Google: Page Rank (1)Goal: to give a total order (rank) on the importance of all pages they index
Possible Method: count the number of inlinks.
Problems?LJ
<4Google: Page Rank (2)Refined Method: weigh the source by its importance.
Seems like this is a self recursive definition.
Imagine the following:
Jane Surfer, has a lot of time to waste, and randomly selects an outgoing link each time she gets to a page.
She counts how often she visits each page.:ZVXKSide Topic: Markov ChainsVA discrete time stochastic process is a sequence of random variables {X0, X1, & , Xn, & } where the 0, 1, & , n, & are discrete points in time.
A Markov chain is a discretetime stochastic process defined over a finite (or countably infinite) set of states S in terms of a matrix P of transition probabilities.
Memorylessness property: for a Markov chain
Pr[Xt+1 = j  X0 = i0, X1 = i1, & , Xt = i] = Pr[Xt+1 = j  Xt = i]
FZ %<c Q
N$ YLSide Topic: Markov ChainsLet pi(t) be the probability of being in state i at time step t.
Let p(t) = [p0(t), p1(t), & ] be the vector of probabilities at time t.
For an initial probability distribution p(0), the probabilities at time n are
p(n) = p(0) Pn
A probability distribution p is stationary if p = p P (i.e. is an eigenvector of P with eigenvalue 1)VB[,
3,XZMSide Topic: Markov ChainsA markov chain is irreducible if the underlying graph (of P) consists of a single strong component (i.e. all states are reachable from all other states.
A period of a state i is the integer t such that Pnii = 0 for all n t, 2t, 3t, &
(i.e. on
!"#$%&'()*+,./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{}~x(Root EntrydO)CO}"@PicturesCurrent UserFDSummaryInformation(PowerPoint Document(DocumentSummaryInformation8/n.30,Microsoft Equation 3.00uL1 Equation Equation.30,Microsoft Equation 3.00wT1 Equation Equation.30,Microsoft Equation 3.00\2 Equation Equation.30,Microsoft Equation 3.0/00DTimes New Roman$ `H!0`hz0DComic Sans MSn$ `H!0`hz0B DSymbolans MSn$ `H!0`hz00Dcmsy10ans MSn$ `H!0`hz0"a.
@n?" dd@ @@`` X% R %&)\
#sK`H;
2 4 D"##+
!$*
'_2$]H!;ʟz $ 2$4oMr~z 2$Fv
HTzNX 2$`Anx;*;1 c$@G4IdId!0Tp ppp p
p
p@pp`ppp0ppPppp<4!d!do0<4BdBdo0uʚ;!{z6ʚ;<4dddd0l*"___PPT9/01N8(?(?O=Q#15853:Algorithms in the Real WorldAIndexing and Searching III
Link Analysis
Near duplicate removal,''Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !WJ
Link AnalysisvThe goal is to rank pages.
We want to take advantage of the link structure to do this.
Two main approaches:
Static: we will use the links to calculate a ranking of the pages offline (Google)
Dynamic: we will use the links in the results of a search to dynamically determine a ranking (Clever hubs and authorities)NlWMv91The Link GraphView documents as graph nodes and the hyperlinks between documents as directed edges.
Can give weights on edges (links) based on
position in the document
weight of anchor term
# of occurrences of link
& ,J" J:2The Link MatrixAn adjacency matrix of the link graph.
Each row corresponds to the OutLinks of a vertex
Each column corresponds to the Inlinks of a vertex
Often normalize columns to sum to 1 so the dotproduct with self is 1.
VIThe Link Matrix
;3Google: Page Rank (1)Goal: to give a total order (rank) on the importance of all pages they index
Possible Method: count the number of inlinks.
Problems?LJ
<4Google: Page Rank (2)Refined Method: weigh the source by its importance.
Seems like this is a self recursive definition.
Imagine the following:
Jane Surfer, has a lot of time to waste, and randomly selects an outgoing link each time she gets to a page.
She counts how often she visits each page.:ZVXKSide Topic: Markov ChainsVA discrete time stochastic process is a sequence of random variable
#$%&')*+,./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvws {X0, X1, & , Xn, & } where the 0, 1, & , n, & are discrete points in time.
A Markov chain is a discretetime stochastic process defined over a finite (or countably infinite) set of states S in terms of a matrix P of transition probabilities.
Memorylessness property: for a Markov chain
Pr[Xt+1 = j  X0 = i0, X1 = i1, & , Xt = i] = Pr[Xt+1 = j  Xt = i]
FZ %<c Q
N$ YLSide Topic: Markov ChainsLet pi(t) be the probability of being in state i at time step t.
Let p(t) = [p0(t), p1(t), & ] be the vector of probabilities at time t.
For an initial probability distribution p(0), the probabilities at time n are
p(n) = p(0) Pn
A probability distribution p is stationary if p = p P (i.e. is an eigenvector of P with eigenvalue 1)VB[,
3,XZMSide Topic: Markov ChainsA markov chain is irreducible if the underlying graph (of P) consists of a single strong component (i.e. all states are reachable from all other states.
A period of a state i is the integer t such that Pnii = 0 for all n t, 2t, 3t, &
(i.e. only comes back to itself every t steps).
A state is aperiodic if it has period 1.
A Markov chain is aperiodic if all states are aperiodic.
A state is recurrent if upon entering this state the process will definitely return to the state~+L ' ) Mb\
&
c[NSide Topic: Markov ChainsVFundamental Theorem of Markov Chains:
Any irreducible, finite, and aperiodic Markov chain has the following properties:
All states are ergodic (aperiodic and recurrent)
There is a unique stationary distribution p > 0 for all states.
Let N(i,t) be the number of times the Markov chain visits state i in t steps. Then limt ! 1 N(i,t) = pi tx" %S&o
PC
:
=5Google: Page Rank (3)Remaining problem: Gives too high of a rank to the Yahoo s of the world, and too low to clusters with low connectivity into the cluster.
Solution: With some probability jump to a random page instead a random outgoing link.@xL>6Google: Page Rank (4)lWant to find p, such that Tp = p
This corresponds to finding the principal eigenvector.
Methods:
Power method: iterate pi+1 = Tpi until it settlespick p0 randomly, decomposing p0 into eigenvectors a1 e1 + a2 e2 + & we have pi = l1i a1 e1 + l2i a2 e2 + &
Multilevel method: solve on contracted graph, use as initial vector for power method on the expanded graph.
Lancoz method: diagonalize, then iterate
All methods are quite expensiveaZ6Z Z
8 \
Pb0?7Hubs and Authorities (1)Goal: To get a ranking for a particular query (instead of the whole web)
Assume: We have a search engine that can give a set of pages P that match a query0ZEM@8Hubs and Authorities (2)For a particular query we might want to find:
The authorities on the topic (where the actual information is kept)
The hub sites that point to the best information on the topic (typically some kind of link site)
Important authorities will have many pages that point to them
Important hubs will point to many pages
But: As with pagerank, just having a lot of pointers does not mean much h." .fEA9Hubs and Authorities (3)vector h is a weight for each document in D for how good a hub it is
vector a is a weight for each document in D for how good an authority it is
T is the adjacency matrix
Now repeat:
h = TTa
a = Th
until it settles.$,B:Hubs and Authorities (4)What is this process doing?
ai+1 = (TTT) ai
hi+1 = (TTT) hi
Power method for eigenvectors of TTT and TTT
SVD(T) : first column of U and first row of V$1,mC;Hubs and Authorities (5)RProblem (Topic Drift): Hubs and Authorities will tend to drift to the general, instead of specific.
e.g. you search for Indexing Algorithms which identifies all pages with those terms in it, and the back and forward pages.Now you use the SVD to find Hubs + AuthoritiesThese will most likely be pages on Algorithms in general rather than Indexing Algorithms.
Solution: weight the documents with indexing prominently in them, more heavily.
a = TWh, h = TTWa, where W is a diagonal matrix with weights on the diagonal.
Same as finding SVD of T = TWL*ZSY^,\D<
Extensions4What about second eigenvectors?
Mix terms and links.E=Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !F> Syntacting Clustering of the WebProblem: Many pages on the web are almost the same but not exactly the same.
Mirror sites with local identifier and/or local links.
Spam sites that try to improve their pagerank
Plagiarized sites
Revisions of the same document
Program documentation converted to HTML
Legal documents
These near duplicates cause web indexing sites a huge headache.@M@F@G?Syntactic ClusteringIGoal: find pages for which either
Most of the content is the same
One is contained in the other (possibly with changes
Constraints:
Need to do it with only a small amount of information per document: a sketch
Comparison of documents should take time linear in the size of the sketch
Should be reasonably robust against adversary
"U"
UFxH@Syntactic Clustering
Any ideas?IAwShingling~View each shingle as a value.
Sw(A) : the set of wshingles for document A.
As shorthand, I will often drop the w subscript.$ ^^JBResemblance and ContainmentResemblance:
Containment:MC#Calculating Resemblance/ContainmentMethod 1:
Keep all shingles and calculate union and intersection of the shingles directly.
Problem: can take Dw space per document (larger than original document)
Other ideas?&
ODUsing Min Valued ShinglesHow about viewing each shingle as a number and selecting the 100 minimum valued shingles?
e.g. append the ASCII codes
Possible problem?6ZZQEHash}How about a random hash?
i.e. for each shingle x take h(x), then pick the minimum k values.
Sk(A) = mink{h(x) : x 2 S(A)}H~a
`RF!Some Theory: Pairwise independent
Universal Hash functions: (Pairwise independent)
H : a finite collection (family) of hash functions mapping U ! {0...m1}
H is universal if,
for h 2 H picked uniformly at random,
and for all x1, x2 2 U, x1 x2
Pr(h(x1) = h(x2)) 1/m
The class of hash functions
hab(x) = ((a x + b) mod p) mod m
is universal (p m is a prime, a = {1& p1}, b = {0& p1})1^H!<= ,*, ZSG Some Theory: Minwise independent
Minwise independent permutations:
Sn : a finite collection (family) of permutations mapping {1& n} ! {1& n}
H is minwise independent if,
for p 2 Sn picked uniformly at random,
and for X {1& n}, and all x 2 X
Pr(min{p(X)} = p(x)) 1/X
It is actually hard to find a compact collection of hash functions that is minwise independent, but we can use an approximation.h#gJ ?)ZJ.UHBack to similarityIf p 2 Sn and Sn is minwise independent then:
This suggests we could just keep one minimum value as our sketch , but our confidence would be low.
What we want for a sketch of size k is either
use k p s,
or keep the k minimum values for one p
3P+/8NP
<$(
<r
< S
0
r
< S,
H
<0h ? 3frNYL*\d(0
2=
rKz Equation Equation.30,Microsoft Equation 3.00uL1 Equation Equation.30,Microsoft Equation 3.00wT1 Equation Equation.3
!"#$%&'()*+,.H0123456789:;<=>?@ABCDEGOh+'0hp
,4%15499: Algorithms and Applicationsros
Guy Blellochrit
Guy Blellochrit134Microsoft PowerPointnd @P@ 뫉( @/@p O}Gj
g K& &&#TNPP02OMi
&
TNPP &&TNPP
 !&G&j}w@
%}ww0 &Gy& @Times New Roman}ww0 .
2
15 . . 2
5. .2
853%
.&y& .2
=Page . . 2
f1 .0 @BComic Sans MS
0}ww0 3.
2
<15. 3. 2
o5. 3.72
853:Algorithms in the Real World$
%
2
.3=t3 3v<QY0 3t@BComic Sans MS
}ww0 t.K2
SIndexing and Searching III (well actually II)
. t. 2
5. t.2
Link Analysis
. t. 2
5. t.(2
Near duplicate removal
."System
0&TNPP ՜.+,0ly comes back to itself every t steps).
A state is aperiodic if it has period 1.
A Markov chain is aperiodic if all states are aperiodic.
A state is recurrent if upon entering this state the process will definitely return to the state~+L ' ) Mb\
&
c[NSide Topic: Markov ChainsVFundamental Theorem of Markov Chains:
Any irreducible, finite, and aperiodic Markov chain has the following properties:
All states are ergodic (aperiodic and recurrent)
There is a unique stationary distribution p > 0 for all states.
Let N(i,t) be the number of times the Markov chain visits state i in t steps. Then limt ! 1 N(i,t) = pi tx" %S&o
PC
:
=5Google: Page Rank (3)Remaining problem: Gives too high of a rank to the Yahoo s of the world, and too low to clusters with low connectivity into the cluster.
Solution: With some probability jump to a random page instead a random outgoing link.@xL>6Google: Page Rank (4)lWant to find p, such that Tp = p
This corresponds to finding the principal eigenvector.
Methods:
Power method: iterate pi+1 = Tpi until it settlespick p0 randomly, decomposing p0 into eigenvectors a1 e1 + a2 e2 + & we have pi = l1i a1 e1 + l2i a2 e2 + &
Multilevel method: solve on contracted graph, use as initial vector for power method on the expanded graph.
Lancoz method: diagonalize, then iterate
All methods are quite expensiveaZ6Z Z
8 \
Pb0?7Hubs and Authorities (1)Goal: To get a ranking for a particular query (instead of the whole web)
Assume: We have a search engine that can give a set of pages P that match a query0ZEM@8Hubs and Authorities (2)For a particular query we might want to find:
The authorities on the topic (where the actual information is kept)
The hub sites that point to the best information on the topic (typically some kind of link site)
Important authorities will have many pages that point to them
Important hubs will point to many pages
But: As with pagerank, just having a lot of pointers does not mean much h." .fEA9Hubs and Authorities (3)vector h is a weight for each document in D for how good a hub it is
vector a is a weight for each document in D for how good an authority it is
T is the adjacency matrix
Now repeat:
h = TTa
a = Th
until it settles.$,B:Hubs and Authorities (4)What is this process doing?
ai+1 = (TTT) ai
hi+1 = (TTT) hi
Power method for eigenvectors of TTT and TTT
SVD(T) : first column of U and first row of V$1,mC;Hubs and Authorities (5)RProblem (Topic Drift): Hubs and Authorities will tend to drift to the general, instead of specific.
e.g. you search for Indexing Algorithms which identifies all pages with those terms in it, and the back and forward pages.Now you use the SVD to find Hubs + AuthoritiesThese will most likely be pages on Algorithms in general rather than Indexing Algorithms.
Solution: weight the documents with indexing prominently in them, more heavily.
a = TWh, h = TTWa, where W is a diagonal matrix with weights on the diagonal.
Same as finding SVD of T = TWL*ZSY^,\D<
Extensions4What about second eigenvectors?
Mix terms and links.E=Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !F> Syntacting Clustering of the WebProblem: Many pages on the web are almost the same but not exactly the same.
Mirror sites with local identifier and/or local links.
Spam sites that try to improve their pagerank
Plagiarized sites
Revisions of the same document
Program documentation converted to HTML
Legal documents
These near duplicates cause web indexing sites a huge headache.@M@F@G?Syntactic ClusteringIGoal: find pages for which either
Most of the content is the same
One is contained in the other (possibly with changes
Constraints:
Need to do it with only a small amount of information per document: a sketch
Comparison of documents should take time linear in the size of the sketch
Should be reasonably robust against adversary
"U"
UFxH@Syntactic Clustering
Any ideas?IAwShingling~View each shingle as a value.
Sw(A) : the set of wshingles for document A.
As shorthand, I will often drop the w subscript.$ ^^JBResemblance and ContainmentResemblance:
Containment:MC#Calculating Resemblance/ContainmentMethod 1:
Keep all shingles and calculate union and intersection of the shingles directly.
Problem: can take Dw space per document (larger than original document)
Other ideas?&
ODUsing Min Valued ShinglesHow about viewing each shingle as a number and selecting the 100 minimum valued shingles?
e.g. append the ASCII codes
Possible problem?6ZZQEHash}How about a random hash?
i.e. for each shingle x take h(x), then pick the minimum k values.
Sk(A) = mink{h(x) : x 2 S(A)}H~a
`RF!Some Theory: Pairwise independent
Universal Hash functions: (Pairwise independent)
H : a finite collection (family) of hash functions mapping U ! {0...m1}
H is universal if,
for h 2 H picked uniformly at random,
and for all x1, x2 2 U, x1 x2
Pr(h(x1) = h(x2)) 1/m
The class of hash functions
hab(x) = ((a x + b) mod p) mod m
is universal (p m is a prime, a = {1& p1}, b = {0& p1})1^H!<= ,*, ZSG Some Theory: Minwise independent
Minwise independent permutations:
Sn : a finite collection (family) of permutations mapping {1& n} ! {1& n}
H is minwise independent if,
for p 2 Sn picked uniformly at random,
and for X {1& n}, and all x 2 X
Pr(min{p(X)} = p(x)) 1/X
It is actually hard to find a compact collection of hash functions that is minwise independent, but we can use an approximation.h#gJ ?)ZJ.UHBack to similarityIf p 2 Sn and Sn is minwise independent then:
This suggests we could just keep one minimum value as our sketch , but our confidence would be low.
What we want for a sketch of size k is either
use k p s,
or keep the k minimum values for one p
3P+/8NPr HZ\d(0
2=
rKz Equation Equation.30,Microsoft Equation 3.00uL1 Equation Equation.30,Microsoft Equation 3.00wT1 Equation Equation.30,Microsoft Equation 3.00\2 Equation Equation.30,Microsoft Equation 3.0/00DTimes New Roman$ `H!0`hz0DComic Sans MSn$ `H!0`hz0B DSymbolans MSn$ `H!0`hz00Dcmsy10ans MSn$ `H!0`hz0"a.
@n?" dd@ @@`` X% R %&)\
#sK`H;
2 4 D"##+
!$*
'_2$]H!;ʟz $ 2$4oMr~z 2$Fv
HTzNX 2$`Anx;*;1 c$@G4IdId!0Tp ppp p
p
p@pp`ppp0ppPppp<4!d!do0<4BdBdo0uʚ;!{z6ʚ;<4dddd0l*"___PPT9/01N8(?(?O=Q#15853:Algorithms in the Real WorldTIndexing and Searching III (well actually II)
Link Analysis
Near duplicate removal,.'.'Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !WJ
Link AnalysisvThe goal is to rank pages.
We want to take advantage of the link structure to do this.
Two main approaches:
Static: we will use the links to calculate a ranking of the pages offline (Google)
Dynamic: we will use the links in the results of a search to dynamically determine a ranking (Clever hubs and authorities)NlWMv91The Link GraphView documents as graph nodes and the hyperlinks between documents as directed edges.
Can give weights on edges (links) based on
position in the document
weight of anchor term
# of occurrences of link
& ,J" J:2The Link MatrixAn adjacency matrix of the link graph.
Each row corresponds to the OutLinks of a vertex
Each column corresponds to the Inlinks of a vertex
Often normalize columns to sum to 1 so the dotproduct with self is 1.
VIThe Link Matrix
;3Google: Page Rank (1)Goal: to give a total order (rank) on the importance of all pages they index
Possible Method: count the number of inlinks.
Problems?LJ
<4Google: Page Rank (2)Refined Method: weigh the source by its importance.
Seems like this is a self recursive definition.
Imagine the following:
Jane Surfer, has a lot of time to waste, and randomly selects an outgoing link each time she gets to a page.
She counts how often she visits each page.:ZVXKSide Topic: Markov ChainsVA discrete time stochastic process is a sequence of random variables {X0, X1, & , Xn, & } where the 0, 1, & , n, & are discrete points in time.
A Markov chain is a discretetime stochastic process defined over a finite (or countably infinite) set of states S in terms of a matrix P of transition probabilities.
Memorylessness property: for a Markov chain
Pr[Xt+1 = j  X0 = i0, X1 = i1, & , Xt = i] = Pr[Xt+1 = j  Xt = i]
FZ %<c Q
N$ YLSide Topic: Markov ChainsLet pi(t) be the probability of being in state i at time step t.
Let p(t) = [p0(t), p1(t), & ] be the vector of probabilities at time t.
For an initial probability distribution p(0), the probabilities at time n are
p(n) = p(0) Pn
A probability distribution p is stationary if p = p P (i.e. is an eigenvector of P with eigenvalue 1)VB[,
3,XZMSide Topic: Markov ChainsA markov chain is irreducible if the underlying graph (of P) consists of a single strong component (i.e. all states are reachable from all other states.
A period of a state i is the integer t such that Pnii = 0 for all n t, 2t, 3t, &
(i.e. only comes back to itself every t steps).
A state is aperiodic if it has period 1.
A Markov chain is aperiodic if all states are aperiodic.
A state is recurrent if upon entering this state the process will definitely return to the state~+L ' ) Mb\
&
c[NSide Topic: Markov ChainsVFundamental Theorem of Markov Chains:
Any irreducible, finite, and aperiodic Markov chain has the following properties:
All states are ergodic (aperiodic and recurrent)
There is a unique stationary distribution p > 0 for all states.
Let N(i,t) be the number of times the Markov chain visits state i in t steps. Then limt ! 1 N(i,t) = pi tx" %S&o
PC
:
=5Google: Page Rank (3)Remaining problem: Gives too high of a rank to the Yahoo s of the world, and too low to clusters with low connectivity into the cluster.
Solution: With some probability jump to a random page instead a random outgoing link.@xL>6Google: Page Rank (4)lWant to find p, such that Tp = p
This corresponds to finding the principal eigenvector.
Methods:
Power method: iterate pi+1 = Tpi until it settlespick p0 randomly, decomposing p0 into eigenvectors a1 e1 + a2 e2 + & we have pi = l1i a1 e1 + l2i a2 e2 + &
Multilevel method: solve on contracted graph, use as initial vector for power method on the expanded graph.
Lancoz method: diagonalize, then iterate
All methods are quite expensiveaZ6Z Z
8 \
Pb0?7Hubs and Authorities (1)Goal: To get a ranking for a particular query (instead of the whole web)
Assume: We have a search engine that can give a set of pages P that match a query0ZEM@8Hubs and Authorities (2)For a particular query we might want to find:
The authorities on the topic (where the actual information is kept)
The hub sites that point to the best information on the topic (typically some kind of link site)
Important authorities will have many pages that point to them
Important hubs will point to many pages
But: As with pagerank, just having a lot of pointers does not mean much h." .fEA9Hubs and Authorities (3)vector h is a weight for each document in D for how good a hub it is
vector a is a weight for each document in D for how good an authority it is
T is the adjacency matrix
Now repeat:
h = TTa
a = Th
until it settles.$,B:Hubs and Authorities (4)What is this process doing?
ai+1 = (TTT) ai
hi+1 = (TTT) hi
Power method for eigenvectors of TTT and TTT
SVD(T) : first column of U and first row of V$1,mC;Hubs and Authorities (5)RProblem (Topic Drift): Hubs and Authorities will tend to drift to the general, instead of specific.
e.g. you search for Indexing Algorithms which identifies all pages with those terms in it, and the back and forward pages.Now you use the SVD to find Hubs + AuthoritiesThese will most likely be pages on Algorithms in general rather than Indexing Algorithms.
Solution: weight the documents with indexing prominently in them, more heavily.
a = TWh, h = TTWa, where W is a diagonal matrix with weights on the diagonal.
Same as finding SVD of T = TWL*ZSY^,\D<
Extensions4What about second eigenvectors?
Mix terms and links.E=Indexing and Searching OutlineIntroduction: model, query types
Inverted Indices: Compression, Lexicon, Merging
Vector Models: Weights, cosine distance
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Near Duplicate Removal:\', !F> Syntacting Clustering of the WebProblem: Many pages on the web are almost the same but not exactly the same.
Mirror sites with local identifier and/or local links.
Spam sites that try to improve their pagerank
Plagiarized sites
Revisions of the same document
Program documentation converted to HTML
Legal documents
These near duplicates cause web indexing sites a huge headache.@M@F@G?Syntactic ClusteringIGoal: find pages for which either
Most of the content is the same
One is contained in the other (possibly with changes
Constraints:
Need to do it with only a small amount of information per document: a sketch
Comparison of documents should take time linear in the size of the sketch
Should be reasonably robust against adversary
"U"
UFxH@Syntactic Clustering
Any ideas?IAwShingling~View each shingle as a value.
Sw(A) : the set of wshingles for document A.
As shorthand, I will often drop the w subscript.$ ^^JBResemblance and ContainmentResemblance:
Containment:MC#Calculating Resemblance/ContainmentMethod 1:
Keep all shingles and calculate union and intersection of the shingles directly.
Problem: can take Dw space per document (larger than original document)
Other ideas?&
ODUsing Min Valued ShinglesHow about viewing each shingle as a number and selecting the 100 minimum valued shingles?
e.g. append the ASCII codes
Possible problem?6ZZQEHash}How about a random hash?
i.e. for each shingle x take h(x), then pick the minimum k values.
Sk(A) = mink{h(x) : x 2 S(A)}H~a
`RF!Some Theory: Pairwise independent
Universal Hash functions: (Pairwise independent)
H : a finite collection (family) of hash functions mapping U ! {0...m1}
H is universal if,
for h 2 H picked uniformly at random,
and for all x1, x2 2 U, x1 x2
Pr(h(x1) = h(x2)) 1/m
The class of hash functions
hab(x) = ((a x + b) mod p) mod m
is universal (p m is a prime, a = {1& p1}, b = {0& p1})1^H!<= ,*, ZSG Some Theory: Minwise independent
Minwise independent permutations:
Sn : a finite collection (family) of permutations mapping {1& n} ! {1& n}
H is minwise independent if,
for p 2 Sn picked uniformly at random,
and for X {1& n}, and all x 2 X
Pr(min{p(X)} = p(x)) 1/X
It is actually hard to find a compact collection of hash functions that is minwise independent, but we can use an approximation.h#gJ ?)ZJ.UHBack to similarityIf p 2 Sn and Sn is minwise independent then:
This suggests we could just keep one minimum value as our sketch , but our confidence would be low.
What we want for a sketch of size k is either
use k p s,
or keep the k minimum values for one p
3P+/8NP0(
l
C
l
C
`
H
0h ? ̙33rZvZV\&