ԞoN@*l+al>Ğ38(01TV
`iP`eFv¿ Q;B2sSRs{L3ρ}Px@I;3fpx̫aL^{,3 &͟}90pOC*o]0Z%QAGoX:>d~;
%}R^[[g]%`pYĤ\Y\te[05J[`!sz~$,d`X
:Uxcdd``g2
ĜL0##0KQ*
WÔ.d3H1)fY[3PT
obIFHeA*P
PD.#lML 0u
LATN`gbMVKWMc`oZ.`P3vqa2aVʬ ~
Ä`db?
W&0;عb L@rtp{ۙA](,pbH!2O`fa:'^tj0c%,.hJ.]`pbdbR
,.I:\~dNs2`!ngy,[ZC./`h{<xڥTk@~&f#jVAAҢ^DPA`#Vݲɖ`YxPz%W=\ae?:f=:ҝZ?wP.t咉)1օ$uĚ"M](mˬ>zϢ6r ?#sM}.N8E=:zj0>
iϟ&tj,qqodՄ_cZ{5=uK)qߦ?uݭϟ>mX&
k0BQ?'K&~
sŽ(GZpznRxf'"lf/`!76 lW4&
wxڥT1K@~.ml"Xus
VH+8ZµZ)E8(S89$4ZM}w}}0H< ̃b%p!c&bElM$wfd9\dt#ہINU)vNʒC?ٿ.UR[9ٱ`;W{ەo
,J0!"vLg)߆tbbt{}Kr3/^VOuj]ft0VhLoFxʬ+hi<ӑP!Wx7){X~ٿ(F]S:+xa}?E$ţvfwzls\y7ﳈGQ_W};;t,Rlk
Y2ӢY,~F6+u`2hԁlRVZ+JXi6+u`2hԁlRVZ+JXi6+u`2hԁlRVZ+v[>?l=J{n\Ƹ`f}@]HYnB]HYnB]HYnjs3momuJ/nhqЖ8!8Ж8!8Ж8!8Ж8!8Ж8!8Ж8!8Ж8!8Ж8XWbEZm^^FG[lkqЖ8!8Ж8!8Ж8!8Ж8!8Ж8!8Ж8!8Ж8ZWϷ^GҾFۻ+U=Kc}աw6}աw6}աw6}աw6}աw66<6Ƒ&汈\Hme7s.me7s.me7s.me7s.me7s.me7s.me7syk26Fhz m#]hz m#]hz m#]hz m#]hz m#]hz韩6lѶP\`T)Vu2C3(jD[yΠmU'3A;FU̐BLs/k`~mW֝jw>ٶm[ɹQ k}{{rNPhKژ$I'egibWqi);
]3[hU}>WsR^/0e~^ݓ=Ұ*3C?vO1ۓUmS.etiZhR_}:uƁu(k߱率]̝S3koјmW{1ĝS'+9Xoݴj,i794maunZ=°Or6u}aЧ+g]CA}M)!F{HMF;ץ1&eϿs!
ҏ9pmҭ
{+g8pw#$.ko?*s`Zpm;qӢl`m܆AYzm*OBw_ٓ]ۍ}T해yOҎ6.[
F;?Tn/xYey;@n.'^^jsX}D+;_nW,WI;tvmcBp+Ўv0Swmhs@{)_vuh36 m.Gh35
T]
`Wf_6'jūwơ=9K_*;mQ5fGPmN&Ӿ7sj=z9eLd
dqBcgأgܣg9v
4p➢}.x9p61'{+oGzԆ{) =}ls}lm#ksm}oj;tn}ߟH.W`/xOrپ.~~xsM8ۏ9ZnWgqWߌ$[Ν8eA$q֨HNY8;3I_gּgkoh)_zLohq
Nh?ݞ9_>up]@yܚ{U_Y^W/?[w5鯪oe~c$+Tq<sڿoŠlBTf/^U,9f ]Ak{͒mҵ,9f ]Ak{͒mҵ,9f ]yy fq)NX4ڍKvJƢnl\0S5vcRh2lEظahdk,ƥC;%[ch76.e)}}g`}
3/m9[6?[`8{rl߇6wT36I=䷅7ɥY }9x':oxٓwpF[Sm6W`#6ivq\+"HsBJ䋉m@,V/QBm@,V/Q^XEE;ҽlhgԋv{b&HŢM,R/ڑeE;X^#v6HhGlbzюt/[,"^XEE;ҽlhgԋv{b&HŢM,R/ڑeE;X^#v6HhGlbzюt/[,"^XEE;ҽlhgԋv{b&HŢM,R/ڑeE;X^#v6HhGlbzюt/[,"^XEE;ҽlhgԋv{b2IENDB``!xVa?شpxڕK#A{]p*WHΑ~Z
$&?`
VVbai#X;++KmΔ6zzgAey
R$#Cn
%B4RR%!zF!!yT1>I(,WW炲p ]R8B=zi5j\E4ӐtwfJfH;sZ
nws1E+F W'PNG
IHDRc'УaPLTEU$$U$$IIUIImmUmmUUU۪U$$U$$$$$$U$$$$$I$IU$I$I$m$mU$m$m$$U$$$$U$$$$U$۪$$$U$$IIUIII$I$UI$I$IIIIUIIIIImImUImImIIUIIIIUIIIIUI۪IIIUIImmUmmm$m$Um$m$mImIUmImImmmmUmmmmmmUmmmmUmmmmUm۪mmmUmmU$$U$$IIUIImmUmmUUU۪UU$$U$$IIUIImmUmmUUU۪UU$$U$$IIUIImmUmmےےUےے۶۶U۶۶U۪UU$$U$$IIUIImmUmmUUU۪Un.bKGDHcmPPJCmp0712HsIDATx^] &Sodb)Q6t&4R̃<.O35b{&fub{&fub{&fub{&fub{&fub{&fub{&fub{&fub{&fub{&fc
y_ˤhM&dJ >}+yˀ,?рԷf==ht篮N:#Ċ މ\Kxk⯎t<2d^HwF\]n IĺʊX/^#%JKa)L+Vo^bvyl_I>HoGx;>v#bl[puMs5el[Kv+]˺&'tzveYQml>fdAFX36J Օc<]ǶR^]Yc_a&}[c3b{&fub{&fub{&fub{&fub[FZmd, >"խ2٦ܟ6>V~:e+K#eXZ`R)>#2uq2'
1?۪s8iۓVSIOʟmU9eIKB)'O*Ĝ2NZU!qғ's[bN's{Ҫs8I
1=iiU9eVSIܞ*Ĝ2NzRdnB)en0rU9eA??B)y(t4٨?sH"N'DyCq8i'
q)Uxd}Gq8i(
m;J{bǭ
1,+3т4T',j,#f?_~q^'*j,v3}**(`[#$NYmYloZNCܥ{K+T,#i8"8e9vn&12E¶Ns@%Жns.qRUϞ[963`iq T!)lz훰OCT+'8IOUUf`DmKMIF46Im^qO=`3l_NtNt9RRͱ+i.٩?sk%6o!?2[lgc`6W5Mi?5_]j#GQmY>P2Nz'!唑F>'Uh8e=R?[86SFI/vDAF ~Pnb{8ljAT[z,2RĎV.84]yHs=߫)AGo{dh:8etױl?4b/NʶxϾ!mJKDOC!ZYj@q8zl
{bqOp1^FbN'[3x%5B)'JRSI7tVMO3RڅT!q
`ī+{OLnH+Ik3Zӧ
1]h}i.,5n5pPWp'ǞSrqҚ~]05JojG WSIkmX5+(/.֍)7c2HUa@(Հqmd%L(2/$<;+ؚB)Abv/mYPd*W,ѳqҰ@rww~Fw`mt@_4c)l[Dvo]c\ƻdqB7ϵc{\p;@mZmmawH]LN'%agؘV2Q:i
1v)hsN>ۆUM #;1eVxӴ<2N@i`w\m;F8zWIK\qU%HU"cz4wF2}ٕӬGdr4L/m&wϼM_od5v١7lXzNPkF
KMQ{HDv!ĒԝёtACvB+VLyw۔S&zEv ӸifȽUسg9?8Mz>Lldt{
v2vϊ sat14СmHvB7U
.xi_9&6~6dk!x=Gs x?!6#{CW~$]8;Ex8ߛ#!/gZB}lh
K#+d
Q?em;)i#:ɒSI,~T_k)Si1vO3nU9etNY82$HaT۩HMbN']gNrJf,6{뺍)T\
1.])vBQtRZ՚.e{<.Q)WzlX'Bے]{LnFm\CנB#;/߷jź]7eJ;fr:.k6a57O2LqҕQڹlvsM7LTSI4BM95s*Ĝ2N3י_l"mlkFw\z$ЎhU9!
1a6ۊtB)b;'6us8i((qg[FqҐ+XmϧP6XWƜqҠ`uj=*Ĝ2Ng[3q,deKjU!a>t`xnAH*bV%gLɗHc!FwlmqP`"!ҲcYl+7̩ADp2ً *]R*P= =dob\k3,䔵Ol'g=ĳ]s˶Stz
lp cץF5vSy{;Q<<=P
ͶkWI
v\rΫO
zOjҠ`W}{&?>$**<sK4
E[J/'Omvgdt%+;4ӎ<2H\K;#$Lvu]8v*r]s#+=#۵4HaB#
#H}z3rUS
\v
ꒇ˲vO@mAy鰋WF
#Ub:8ض@P$,f{er)^6b=Hv
ːD~P^Ք%^PG 52b[ںl}тxo*NK>6ߖ!ditttc18ǦvDw
VļX7 JD=AS]>JAxD,RsV5E+W:۲urDl/H'd`B=nrιjѕH
ģZY\vIĹ6\Bzh^lnO3@߂vVls}nNZ+[18P{\bb"02Vp'}x#^9k$miemLBw1[ Nh} }F7DgPfhRȮq9tl1DJb[mLFM>fhyR!0@l`S
3":H"+!\VawHq}UŶ&*& WX=g)˶}Q፹)Cb[GG]Mo]8lǽTXlk
ScR{R'n@.v~.fuqҀy{PQ`V@C7Ϸ;֨@wĵG/iScǳyHqNT!qҒV*lվO+B)>W~0GUsEhr8i(',ZhyC72)#b[)sjfʑs>JjyOl[y8>yHRS~ہ]`@)5Ś HRS.W&O.v*SmxnZPZfP;U!qҨÒr7$}5x*Il#49qIB)Y J+qObN'GV0w c3mc:AgaQk g28wI!8XѨhTck5/''+R![TSUJ]ugv8TUۛ7n&jUusҲ~d8PǬQOQwJL>#idb{T~$ZljۏeQA>Rb2AlժҥkW2x]Ym^U^Z5pD\^l7ݪC@]jo8\uL^X3.cp}3ycb{Ժ~&/cZlZeU1>Sm[Vlg>9m✇ٶdtoow\l[lsJldt;2ïkX[[jWRqŶ&?Y^ب>},`di3¯Pi[QE~۠yݗ.s自zuzB
l2ll9:s@wx"+ﭕ>̽ք[dwvєNu@M$ꈾ,&;ʹlYW?UviPa;W諍ﾂ[EVEvav]!/ԧ[ѷb(g`QoQbo]x1Lk1l8ۃ2o]'77X("U+)iر}v288ExlCv&]kkŝS/WTlW^cn+W0w径oTeׁn.> ǫOB%h19[M5Wl
ٶC)o\s>Go<%UUm(
j<.UPg7*TWm
?Ͽ[A!6&j2ƷȌFo
o{F5VMo+tmEoeawOVd{M(UBńl{۔AmO!9}{]7^;
rS+9Uє" V>FHwNow}j'&əӘZ
>=OmUP&U]^uy2hƦ^"5oe{w_)~ߞF\Ոsӟ&6KbRwypzG;xļXlc8bWU2ѣ}gKzl94إ'L>F
[/w7~?D4w^k_yrVOOIƣ[<
O?"\M_mڟIy?ܘKXVEEO{0IA;鋢wo?̔C"zʩ9ᐃGR_nѺTmDd^m&\ygQ$Tk?̢獗:G)q\B[8~G`F2]!hIbP}a;R?Q,xMHE!\z֏(1
4QuS.:kB8Z
Qvo&5vy7SBo#w#pû v)sK.FY< (
7q"U6V4*(ΑMx**KFBplP˝ɂ
S7
]
]
#%
c I8NXBd'Dv,!Ma"p;&0 KHC%!LvBp_;y
J)=Xwz+Xk^w9%ÚWW,ռ"LۭnWx(7ּx]Ov:A`\ &W8oKa"gkG&mI\f{.e&rށ[x>z[R9,x1mIAeh5%03d(C0UP2!pCl,cuVmi3GL`&fD:#$A "c EbYd,vxO[yYv8CE[j ڋf[rP']2ӥB>5WO
[Fox}jno"puLp<n1cS#6\9i39uKF)yYqS\iʘ& z GQj٬]jB9"Y/<"I/
<ډZװhgV;hQ*^dL۩JT"c*^^4/'MՈ!W'p(И1&Z&jRYaFae4S{Z8WKUA/<89FC[y.E6戴RRvRϑ
Kxax<1*Oyo:7_
z >"]`k^3v+
1\$/ē. op E@4Q]tn*j(耷Jh,$=[Go0Ot>/SpKPUwo._뿿?h{*!E$F^,S?/Rt7.@~vYюN
pApӏeUr'g^>
sиUN}ө4^>I#5P;k˷oh}#o&vFynTj(eei',#
jYw/_TWt(z")[hFZCv)ާ5»44AMzޝ>/U+Mu.]L%3Jsu6lUu+M8e[\.Uu+M=4Н/!Xk4QJ풴`^iZR͘MRU \Ի%:Cvy}.mR^
>IFCgm{KoC.lt5\.ҝ8oem`=Kfu1pF.1;.XW+yeb;xo] xaP gIENDB`nxtLDPNG
IHDRPLTE9c999cc119999c99999999cc9c9cc9ccccccc9999c9cޔޔ9ޔcޔfbKGDHcmPPJCmp0712HsiIDATx^]v
ݤc;qI*M#_W fvVZe
vg'`9A`,giዟ}_~}N!,Gپj［lzAKcfd3}H4sݽ.%o 4v[,eڕ~gbioc? $+E=?g`pi$C{,tTBɍI,s1kl`~,Klp/;0o9/oy87k
TE+pQ*ݔ⫏uggT
]a)P1X
,'v88;zK
"uy%M`ILg/aɨ,y
K`,']3TUO7#,uQZEsTjXʪ7G`Y?qxC(pGbqfiY:pkP/>hGv/_[zWtusz$IZ+47%tw\%ZntK\_O`yv_]ma5n8
oqNr$UByS(s<0vxʦ犤@4~BX@Q}bMPa ֔x˘23UE"){=^>1IGD<>7*HX?BOP #z{
wI܇o>Cm" VR
䉷x:ub`n>]ÅRBa)'A
yJz>O٥Q2SfKťtghINջǏ~x"T
זhnne[vvPUHvpBoP&V8Q]_.[oGіH0=,`t(bx*ofX>k%WUEAXmɞ*"M%?we4Az˧( rERܶ0[fXT`m~Ε8KtLfOHu %
Ev/
oݧ"`q_6=SXܫ
%,4]דC\_C2!iCrO *1,*kxXj51R#ni`oPQsKoɌo(kkXB[.Oy(AjEâ~>f[7nٔ[jXKְ,ۭ9DiPjL>L&'(FA֬э詓
>`v~qtMhSB,+jAl,cAF$R\K
/KyKآQ)>Qf3q&RZƧXJPD:dm,1X2Z1 ISWU!,,:W$ YYHzoIUCw_P!XOy}k.}0UUZf@awv2EcόׂSK+}@d%ޒe1UFac8;˒L݂d=҄ϊ^z`Ky8d'*ebG{a(\;[D5,rư<;UAT*>ռ%@$7j v2IS`ւtpFe[X#klKu+ZJ=([2_bBeeOX\"1bϵOS?,{$J/#Ϫ6GȟtpV"`KڴtGB~p߯Z<_
oRkL=d\(j,*etV8"d{T>GX'Tvf>{X9тXT",^JC,S(CpU"_#~]w:+*[j@wm Ka8tz\U5zp;ҷw_:}G˅7N(_b02xKP)i2sKX^iQ<}2XKIdn'soIoܲ[~K/%ts[nPio_dܒ[Ɖ̞F^&9ޒŸEqc:ڻ)<4nḥ\aNǳ`yF!ԑ=,qBͯU oKP9"%ũΪgy_~moi6WUQ<,K\{B[rU\E1sʏAHƔ_RKUv
K`e
K~t;fvSY>\Y#l)12j*aK"*s;3*`WQ)D<#䁷H1qr$,F]Qibd
KQXBQܒ1gb2CUUYB׃1[Eļ:*\c@glK
,_0ofXMTY!ٍ,a(T1
P~y<,`#Yc(Xb"&]5T0 mY0Pt6RPAofҋ1ɵ$G).;+\Rʷނβ[*\xX<*&XFCXRq;Ih c4yԾoE
`8jo.@@)\$Ipv^2@t%g1oY#$LFʨhcWEEH
_b*nbXؒC0_
gYMWn_ܡ*$mK/T2R]ܡyREF9l+'u("x6K#3c/KK;u0?ʰ4<NĒXbj.'bӨ
7KD,(Jq~4cA
aTBE;x!Ba ;`HTփDaMY[ׁFra1qܞ^Ĵ+qBE+QW2i2b';3 hgKQBiY57sL_=tӋ"*ō3+¤ȴR>ŻJy+Ew%
E`)"&ٹEv~"T(Xп^
K[HsD&@wWX:&/.2/"\%vء/2=,mE]rԽdQU顈ٜWaN^e=KDG%:X$\,*ughMl 6QKnRW)cGJ IQ('5eQ[Yig`GP88(a}JGtXCШdU{tʞf]`j~r8N,giO7[WpGW͛#uFL9h!/PO6uVוO2/Ogޣ^^ӌ]~{3kpqsO]fLɰt;<{di>;tDŵ߅ܺi֢rwKL3=ؼ,cv>^KD3yH8ƠD3ҭR8(*3ޣQ[0vjo{OBXЋ'n.6yؾzxfXv1,ګ΅S=iF#!hrAkTwS79xۍ7Tێ#bL3`QE%=1+d8nƺkޖiN14t9FiYPC.~v6j'@m~\`,7 VlP?n8F(SO6]K2AgeR2Tk3۵+d.qV+'"\3uՋ3Ā,NR 2s)&ؓbb`7VaѦLU)g{LYٮ1RO 28+ەNr0Nmfrx7/\=~mᴝٝTٿ'tkbFړٝ}9N<"ﱛQvԘmHxlt,LQcv'qK۰3oҳuؼeOiLM(A+Ԙ^rmK)8Y^F+\jJq0O(\qchܲuLiLMէ4X&C`g,S!\~IENDB`(0
2= ,
7ҺEquation Equation.30,Microsoft Equation 3.008պEquation Equation.30,Microsoft Equation 3.009ֺEquation Equation.30,Microsoft Equation 3.00:Equation Equation.30,Microsoft Equation 3.00;غEquation Equation.30,Microsoft Equation 3.00wٺEquation Equation.30,Microsoft Equation 3.00=ںEquation Equation.30,Microsoft Equation 3.00>ۺEquation Equation.30,Microsoft Equation 3.00?ܺEquation Equation.30,Microsoft Equation 3.00@ݺEquation Equation.30,Microsoft Equation 3.00AEquation Equation.30,Microsoft Equation 3.00M&ߺEquation Equation.30,Microsoft Equation 3.00y'Equation Equation.30,Microsoft Equation 3.0/0DTimes New Romanpƺ`H!0`hz0DComic Sans MSnpƺ`H!0`hz0B DSymbolans MSnpƺ`H!0`hz00Dcmsy10ans MSnpƺ`H!0`hz0"@DCourier NewSnpƺ`H!0`hz01PDMT ExtraewSnpƺ`H!0`hz0`DTimesraewSnpƺ`H!0`hz0X
a.
@n?" dd@ @@``
1 R
&)\
#sK`H;
2 4 D
"$%
+
+,./011/$2$4n3KLƠ̈́2$=pR
nƵm2$\.Nn}l2$ϔsS9ȷ'r"2$g$muk[LfTh2$vpKil=o2$/J;Z;T4 2$zzF7\⑃'2$sz~$,d
2$gy,[ZC./vC2$76 lW4&2$^epa=6, Basic ModelApplications:
Web, mail and dictionary searches
Law and patent searches
Information filtering (e.g., NYT articles)
Goal: Speed, Space, Accuracy, Dynamic Updatese.eHow big is an Index?wSep 2003, self proclaimed sizes (gg = google, atw = alltheweb, ink = inktomi, tma = teoma)
Source: Search Engine Watchxx! 3/Sizes over time
*'Precision and Recall%Typically a tradeoff between the two.+(Precision and Recall8Does the black or the blue circle have higher precision?Main Approaches
Full Text Searching
e.g. grep, agrep (used by many mailers)
Inverted File Indices
good for short queries
used by most search engines
Signature Files
good for longer queries with many terms
Vector Space Models
good for better accuracy
used in clustering, SVD, & (3(5(3(5,QueriesTypes of Queries on Multiple terms
boolean (and, or, not, andnot)
proximity (adj, within <n>)
keyword sets
in relation to other documents
And within each term
prefix matches
wildcards
edit distance boundsN%g.%g.>%
}Technique used Across MethodsCase folding
London > london
Stemming
compress = compression = compressed
(several offtheshelf English Language stemmers are freely available)
Stop words
to, the, it, be, or, &
how about to be or not to be
Thesaurus
fast > rapid
ZZ ZkZZ8ZZZZ
k8 ,U s)&
Other Methods@Document Ranking:
Returning an ordered ranking of the results
A priori ranking of documents (e.g. Google)
Ranking based on closeness to query
Ranking based on relevance feedback
Clustering and Dimensionality Reduction
Return results grouped into clusters
Return results even if query terms does not appear but are clustered with documents that do
Document Preprocessing
Removing near duplicates
Detecting spam>ZxZ*ZZZ(Z,x*(,b4"!Indexing and Searching OutlineIntroduction: model, query types
Inverted File Indices:
Index compression
The lexicon
Merging terms (unions and intersections)
Vector Models:
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Duplicate Removal:f8GdG6, Documents as Bipartite GraphCalled an Inverted File index
Can be stored using adjacency lists, also called
posting lists (or files)
inverted file entry
Example size of TREC database(Text REtrieval Conference)
538K terms
742K documents
333,856K edges
For the web, multiply by 10K`Q:)Q:) S Documents as Bipartite GraphImplementation Issues:
1. Space for posting lists
these take almost all the space
2. Access to lexicon
btrees, tries, hashing
prefix and wildcard queries
3. Merging posting list
multiple term queries2 3 3g[
1. Space for Posting ListsoPosting lists can be as large as the document data
saving space and the time to access the space is critical for performance
We can compress the lists,
but, we need to uncompress on the fly.
Difference encoding:
Lets say the term elephant appears in documents:
[3, 5, 20, 21, 23, 76, 77, 78]
then the difference code is
[3, 2, 15, 1, 2, 53, 1, 1]3ZJZXZ1Z&ZZ(Z3JC&(
d
Some CodesGamma code:
if most significant bit of n is in location k, then
gamma(n) = 0k n[k..0]
2 log(n) 1 bits
Delta code:
gamma(k)n[k..0]
2 log(log(n)) + log(n)  1 bits
Frequency coded:
base on actual probabilities of each distance_0.
C
0.Global vs. Local ProbabilitiesGlobal:
Count # of occurrence of each distance
Use Huffman or arithmetic code
Local:
generate counts for each list
elephant: [3, 2, 1, 2, 53, 1, 1]
Problem: counts take too much space
Solution: batching
group into buckets by blog(length) cFF,tr
Performance]Bits per edge based on the TREC document collection
Total size = 333M * .66 bytes = 222Mbytes
2. Accessing the LexiconWe all know how to store a dictionary, BUT&
it is best if lexicon fits in memorycan we avoid storing all characters of all words
what about prefix or wildcard queries?
Some possible data structures
Front Coding
Tries
Perfect Hashing
Btrees`,+*+Front CodingFFor large lexicons can save 75% of space
But what about random access?Prefix and Wildcard QueriesdPrefix queries
Handled by all access methods except hashing
Wildcard queries
ngram
rotated lexicon
bngram^Consider every block of n characters in a term:
e.g. 2gram of jezebel > $j,je,ez,ze,eb,el,l$>0/0PMRotated LexiconUConsider every rotation of a term:
e.g. jezebel > $jezebel, l$jezebe, el$jezeb, bel$jeze
Now store lexicon of all rotations
Given a query find longest contiguous block (with rotation) and search for it:
e.g. j*el > search for el$j in lexicon
Note that each lexicon entry corresponds to a single term
e.g. ebel$jez can only mean jezebel#;r(:$#5r:tC3. Merging Posting ListsLets say queries are expressions over:
and, or, andnot
View the list of documents for a term as a set:
Then
e1 and e2 > S1 intersect S2
e1 or e2 > S1 union S2
e1 andnot e2 > S1 diff S2
Some notes:
the sets are ordered in the posting lists
S1 and S2 can differ in size substantially
might be good to keep intermediate results
persistence is important
(ZZ5ZPZZZ(5 f,1n Union, Intersection, and MergingGiven two sets of length n and m how long does it take for intersection, union and set difference?
Assume elements are taken from a total order (<)
Very similar to merging two sets A and B, how long does this take?
What is a lower bound?6(% Union, Intersection, and MergingLower Bound:
There are n elements of A and n + m positions in the output they could belong
Number of possible interleavings:
Assuming comparison based model, the decision tree has that many leaves
Max depth is at least log of number of leaves
Assuming m < n:
N"`PPv P
n
Merging: Upper bounds%Brown and Tarjan show anO(m log((n + m)/m)) upper bound using 23 trees with cross links and parent pointers. Very messy.
We will take different approach, and base an implementation on two operations: split and join
Split and Join can then be implemented on many different kinds of trees.B&{RI Split and Join!Split(S,v) : Split S into two sets S< = {s 2 S  s < v} and S> = {s 2 S  s > v}. Also return a flag which is true if v 2 S.
Split({7,9,15,18,22}, 18) ! {7,9,15},{22},True
Join(S<, S>) : Assuming 8 k< 2 S<, k> in S> : k< < k>returns S< U S>
Join({7,9,11},{14,22}) ! {7,9,11,14,22}(/H(
6
$#Time for Split and JoinSplit(S,v) ! (S<, S>),flag Join(S<, S>) ! S
Naively:
T = O(S)
Less Naively:
T = O(logS)
What we want:
T = O(log(min(S<, S>)))  can be shown
T = O(log S<)  will actually suffice@`
%$
Will also useisEmpty(S) ! boolean
True if the set S is empty
first(S) ! e
returns the least element of S
first({2,6,9,11,13}) ! 2
{e} ! S
creates a singleton set from an element
We assume they can both run in O(1) time.
An ADT with 5 operations!&ZZ
Z8ZZ(ZFZ
(F$Union with Split and JoinUnion(S1, S2) =
if isEmpty(S1) then return S2
else
(S2<, S2>, fl) = Split(S2, first(S1))
return Join(S2<, Union(S2>, S1))~Z
eRuntime of UnionTunion = O(i log oi + i log oi)
Splits Joins
Since the logarithm function is concave, this is maximized when blocks are as close as possible to equal size, therefore
Tunion = O(i=1m log d n/m + 1 e)
= O(m log ((n+m)/m)) 3 Intersection with Split and JoinIntersect(S1, S2) =
if isempty(S1) then return
else
(S2<, S2>, flag) = Split(S2, first(S1))
if flag then
return Join({first(S1)}, Intersect(S2>, S1))
else
return Intersect(S2>, S1)
Z
Efficient Split and JoinKRecall that we want: T = O(log S<)
How do we implement this efficiently?DL&TreapsDEvery key is given a random priority.
keys are stored inorder
priorities are stored in heaporder
e.g. (key,priority) : (1,23), (4,40), (5,11), (9,35), (12,30)J(Z=Z>Z(=(Left Spinal TreapTime to split = length of path from Start to split location l
We will show that this is O(log L) in the expected case, where L is the number of keys between Start and l (inclusive). 10 in the example.
Time to Join is the samelZ<_"Analysis
Analysis ContinuedProof:
i is an ancestor of j iff i has a greater priority than all elements between i and j, inclusive.
there are ij+1 such elements each with equal probability of having the highest priority.6" Analysis ContinuedCan similarly show that: 6And back to Posting Lists We showed how to take Unions and Intersections, but Treaps are not very space efficient.
Idea: if priorities are in the range [0..1) then any node with priority < 1  a is stored compressed.
a represents fraction of uncompressed nodes.6,4,)Case Study: AltaVistaiHow AltaVista implements indexing and searching, or at least how they did in 1998.
Based on a talk by A. Broder and M. Henzinger from AltaVista. Henzinger is now at Google, Broder is at IBM.
The index (posting lists)
The lexicon
Query merging (or, and, andnot queries)
The size of their whole index is about 30% the size of the original documents it encodes.6N[N[ti
Id*AltaVista: the indexAll documents are concatenated together into one sequence of terms (stop words removed).
This allows proximity queries
Other companies do not do this, but do proximity tests in a postprocessing phase
Tokens separate documents
Posting lists contain pointers to individual terms in the single concatenated document.
Difference encoded
Use Front Coding for the Lexicon`YZ!YZ!/,AltaVista: the lexiconThe Lexicon is front coded.
Allows prefix queries, but requires prefix to be at least 3 characters (otherwise too many hits)&aa0AltaVista: query merging\Support expressions on terms involving:AND, OR, ANDNOT and NEAR
Implement posting list with an abstract data type called an Index Stream Reader (ISR).
Supports the following operations:
loc() : current location in ISR
next() : advance to the next location
seek(k) : advance to first location past kq, $1. AltaVista: query merging (cont.)Queries are decomposed into the following operations:
Create : term ! ISR ISR for the term
Or : ISR * ISR ! ISR Union
And : ISR * ISR ! ISR Intersection
AndNot : ISR * ISR ! ISR Set difference
Near : ISR * ISR ! ISR Intersection, almost
Note that all can be implemented with our Treap Data structure.
I believe (from private conversations) that they use a two level hierarchy that approximates the advantages of balanced trees (e.g. treaps).
$66$> ` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dd@$z?" dd@ " @ ` n?" dd@ @@``PP @ ` `p@@(
6tH 0
T Click to edit Master title style!
!
0$H H
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0H ``H
>*
0tH ` H
L*15853
0"H ` H
JPage *H
0h ? 3f Default Design0x (
x
x
Nt~ikk
i
p*
J%%JJnn
x
NDikk i
r*
J%%JJnnd
x
c$ ?J i4
x
Npikk *i
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
x
Tikk T 0 i
p*
J%%JJnn
x
Tikk T0i
r*
J%%JJnnH
x0Bj ? ̙330(
Nikk
i
^*J%%JJnn
Nikk i
`*J%%JJnn
Tikk T 0 i
^*J%%JJnn
TLikk T0i
`*J%%JJnnH
0Bj ? ̙330(
l
CL L
l
CL `
L
H
0h ? ̙33
@0(
x
c$
i 0
i
x
c$@i`i
H
0h ? 3fh
P(
x
c$L 0
L
x
c$dL0L
p
H3?~H
0h ? 3f%
`
(
x
c$;i 0
i
x
c$i0 i
<L
Z* Document Collection
<tL)
5Index
<(ĺ&
5Query
<AiJ p
=
Document ListL
@c$ L
@c$ JL
c$ )&0H
0h ??`
3fq
!p(
x
c$L 0
L
x
c$LL
CAC:\Documents and Settings\guyb\My Documents\realworldF04\sizes.gif`L
<,
?
Billion
PagesH
0h ? 3f
nf(
r
S 0

r
Sa
CAC:\Documents and Settings\guyb\My Documents\realworldF04\sizestrend.gif
H
0h ? 3f
W(
r
StLi 0
i
r
S0Mi`
i
<Fi`+
w9number retrieved that are relevant
total number retrieved":#
<Sic.
w9number relevant that are retrieved
total number relevant ":#
<Yi6
<
Precision:
<\i@%
9Recall:H
0h ? 3f
rj(
r
Si 0
i
r
Si
i
CAC:\Documents and Settings\guyb\My Documents\realworldS03\figures\prerecalldots.gif0B
,
CAC:\Documents and Settings\guyb\My Documents\realworldS03\figures\rec_prec_curve.gifrP
H
0h ? 3f
0(
x
c$i 0
i
x
c$\ii
H
0h ? 3f
0(
x
c$8i 0
i
x
c$ii
H
0h ? 3f
0(
x
c$i 0
i
x
c$dii
H
0h ? 3f
$(
r
S 0
r
S\0
H
0h ? 3fh
(
x
 c$ 0
x
 c$H
p

H3?H
0h ? 3f!
a(
x
c$( 0
x
c$P)`
X2
00`X2
00PX2
00@X2
0P
0
X2
0`pX2
0`0X2
0`X2
0P`X2
0`pX2
0`p
<ip
2&
<i
2& RB
s*D0RB
s*D0@RB
s*D@ RB
s*D RB
s*DRB
s*D@0@RB
s*D
@
<hH50
5terms
<hH
9 Documents
<ܚH`D
:Aardvark
<HV4
7Doc 1H
0h ? 3f
$(
x
c$ȮH 0
H
c$5` H
" @`PpX2
00`X2
00PX2
00@X2
0P
0
X2
0`pX2
0`0X2
0`X2
0P`X2
0`pX2
0`p
<9p
2&
<LH
2& RB
s*D0RB
s*D0@RB
s*D@ RB
s*D RB
s*DRB
s*D@0@RB
s*D
@
<@50
5terms
<(C
9 Documents
<F`D
:Aardvark
<X2
d
0
@PL
dc$B
L
dc$B2nX2
d
0`pR2
d
s*0pX2
d
00P
L
d@c$bn>L
dc$b^
>X2
d
0P`X2
d
0P@L
d@c$^
^L
dc$
N^X2
d
0pP
X2
d
0p0L
d@c$
N~L
dc$>~
!d
<K@u`
5StartR
"ds*jJ.@
#d
<(?H@
3lR
$ds*jJH
d0h ?dd ddd
dddddddd
dd
ddd
dddddd ddd
ddddddddd
ddddd d!dd"d#dd$d 3fN
,,hf(
hx
h c$p 0
`
h
c$A7??
P7`
h
c$A8?? 8`
h
c$A9?? @9`
h
c$A:?? :`
h
c$A;??;`
h
c$Aw??[w`
h
c$A=??
#P
=X2
h
0
%X2
h
0`E
X2
h
0@E
X2
h
0P
eX2
h
00eL
hc$R.
L
hc$r
NL
h@c$7
NL
hc$7
>X2
h
0
@R2
h
s*00L
hc$r
NL
hc$>X2
h
0 L
hc$
L
hc$ N
X2
h
0
@R2
h
s*P0X2
h
000L
h@c$NL
hc$>X2
h
0@P
X2
h
0 P
L
!h@c$">L
"hc$".X2
#h
00pX2
$h
0ppL
%h@c$B
.L
&hc$B
r
'h
<
U
5StartR
(hs*jJ7
R2
)h
s* ^
L
*hc$".
+h
<<Y\
3lR
,hs*jJP
r<H
h0h ? h
hh
hhhh
hhhhh
hhhhhhhhhhhh hhh
hhhhh!hh h"h
h#h%h h$h&h'hh(hh)h*h+h)h,h 3f
<4l(
lx
l c$ 0
l c$(=L
" @`Pp`
l
c$A >??`>
l
<Mm
BLemma: `
l
c$A
??? 7?H
l0h ? 3f
pN(
px
p c$d 0
x
p c$ p
`
p
c$A@??@`
p
c$AA??VC
AF
p
0
Therefore the expected path length and runtime for split and join is O(log l).
Similar technique can be used for other properties of Treaps.O>H
p0h ? 3ft
$tL (
tx
t c$ 0

x
t c$ 
X2
t
0@UX2
t
0
` u
L
tc$GrN#
L
tc$Gn #
L
t@c$g
n CL
tc$g
^
CX2
t
0
` @L
tc$2n X2
t
0
@
L
tc$
L
tc$
2n
X2
t
0
`@X2
t
0P
`L
t@c$2nL
tc$2^
L
t@c$R^
.
L
tc$R
N.
X
t
0
@
X
t
0@p` X
t
0@ X
t
0
0
X
t
0
X
t
0 0H
t0h ?`tttttttt tt
tt
tt
tttttt ttt
tttt 3f
$(
r
S 0

r
S
H
0h ? 3f
0(
x
c$` 0

x
c$
H
0h ? 3f
0(
x
c$ 0

x
c$
H
0h ? 3f
0(
x
c$# 0

x
c$@$
H
0h ? 3f
0(
x
c$( 0

x
c$D)
H
0h ? 3fxW=LTA=@'?*!PX@,0D$@ca»pXB,jaL4X"5X+SCΙ}Aѐ0/sofwvٙ}{V֞}]tdPic%Upg٬ۜ=E[z}.r?
OL}cƸsyl_Pcwf@=S0Xazμ;\sFp7 R0~$'#oAɟji73zOA R!"#WqWqxMk;.eOO=wN8iv3[2orV<37Epi>Pu%סaRg>dqO[/{)K+)oIb%궲m;?")OW@
UJMW!?tC"CZF.r*0ڱ06K*}+(kd&[K1iqUZ7>Z J}N4o0u[G');amӂF5~7`nn2NدE4O(6h%Navr9osXўu`q2]ĵ>eؑ+w'~?w<>n>CgjZ*Y
Z(㼡?HZQ]"w2HzV/śz
1Q%O$eh\4ki9a!ǐrA!9,e,'"h䐞Fm=mM?g"x̖K5:q#Le00
c{v+[=?_Ch7c>yR2bpxWMLA~3 J ^hP۰ۅXO
hԋ^GGOxAOffq[QIx7vn]TqРX
Bǔ
pR.ʎ
ԀZC>]~ߑp 4+me?V;9kvڹ7^0@=SC̺8gxj}\A߆ic.F~:)Z>\̛.*~_&EnOKӀ5cF&Խͨ >
>tPMP}2Y\WYӥMާ&NIdxrRF_wD>Q^.ǞxY'^r_ ?("
^h+Y{%=(F҃mh8Kn֡76N8*7Vn&w4p<;56)3/4ǝ$A/4,˅P}+";jG4xKQL
4'ܟ>l3cD͜1TgɳW1?NvL笌ODB*>DOMrHX\iDWad!M:&Zq=_#1Pr0sv\ADoGa;Jvϕq##2>tzK9{LnuP$
n`gM+[eI==ρ6bˢB5:jfP3R;<\_w18ϠgՖ3/a*0igA:ʧm6ieCh{UJx:)qPTgWdo}30qYxJ=0net^6.[^?;mxVoA3,(4=lL4^$ĞLZQLMi"54]B)T#$xVڛzj?z2XDB
cޛyo{39~z:(б#R8T˲nkOC[;'<1OtH@
5]1wœ;{O?j)w!ف8{%8Yo`&T,r4?esTA{BAfQ^jk"ZeZL>_Oۤ
vi6vQABDTC2SY[!ws페39R.fU}hf
i}QVw$b#uVTY@'Y#T 8C\VGO\$us^xVOkSAhӴzxfAsk%3ZLҹ9PoDxٝ,n}KkAX{A0pwZRs8K\M)HKk!w=
~7 5a<`2cHR5EnA
'wJ]J[KŜIoɐanYIU9izy2YEX)[lI/e>D2v+mVeq{ƫj.0
q>qҘP8`īRʢO+Eel7/^U&㞼44f0NQ#ȥAwPMH͟{01rt+h8F_'ѫۉ"Uo/c)+G~GUsSbxVkA3I&mG)E^PI"MC[)JJWhRJxM=DY{G=o>6ncE,yoefޛwov
t{aJC(*Awx^#,@2`ik
1[~tskR3eJP_^/_,3^6GD]y1v)LV<)>2=LMD4BYLK EiQ;\V ԶQRva>Q+4*r9v)hZź6U(ChkSG[I"jf*Y%[?sOsg9i;G`x@~<*R䙖c>GR)A3v[yg>ni !t~l Nzh1K'pPoY+Mgo8enE<﹄7S\WQFUPb"cѭYofPZ7,w@VZw=ʲG1y̐6۠X.{fQΑyP/LxVMhQ6IM*AizQAhKQ6Mɶ5E'zQăz,D=((/&μ7$AFD)td7oޛff[}ϡvZDdYQ@2j5O\[%E?!Fx\~ӿpPg*ͭ`QZayzݛ2A!1[YO&HQOu߅a8'[S'%?'o$$/8C'_z\{z6v3"w"w!w#82=5@[n__ν_CegɹT8%ks"Riwڗ@b3z{ޣ}TtQޢ9d]f);7UL#!q5J5,zz(Gzk.FQ>i+t:a(3qlrwP^PVCbBўR5%h;q&*~kri4Sxx~B=vbfE?*QƽQIٲkTZn{OMЭ [joDzx[yۘ]ͻo:+Etx<
%}& \wHVLu[*˄ҕI.295T
5v30Y=X@oe~KT Zi)ۧ0
z
udaS83F(
q0 ggǐm(r99q,YFse+,:YrYUe3&24rCWF24~%~9`/aL7؛
j~k__co%k}4LxE(TO 74;"D!xVMLA~3PS("ƊHbD&ƀp` 5h䁄p"8''ѣƛz'e[(nC!6oޛ7of۪;>@
ҙR(v1L&cMgiOb:E<
o)q~a?ܫ`G:cb~U/8e}.\Zc"e0!\M'3e1
GtRLnsf> h{U'_"9N}sh{J[9
l ~Pj9 CJaQxb7Ezf!er>@d&O$ӳd$tRDyYv>vZozywk:p*[#pij՜ז+㷵@UJl: 8mc!2%'GRSaM2
cqq,a6_gqzӘrc<^g[^hRzӰg)2woǔNf';Cw$ѕ}.xι'ODK"f3hVG#h8h(/fW/pi*i[1MK߸x
n>kk>~%Ol'2]7l]78UhnUo>y&yp*G *,ʗ#
;(>kXcLcV#^_(YA(ʈ /T5fQMobEr#&ϻP2Gi/O_pxWMhA~3v6m_ťQ*(JE#ZE
MKJRӚhŃHO*桊M= g=E1&uCaܸ͛Jhh/T6TT
sD}Uk·Gm~? OB'!Wz,H+bL]t\OK_~4gQC,vҁ3{?G?m?'A*.&J*>zyn[%PتUc :ԃrX)hPW{
& &?EtU︮KҽO58J3FpT8K%mmw[cLк
>?bcdtW$搛xXp.Uz%7{2Z1NRRc>K_צT<~hT0اC2UD{r 2%Sqm<Zl&ܪYsַϯG]~ؠaa?O)Q$LX),dIbwO<K;o~5gUĦ
_Ϲ~C1/ЏX"6:#F('Y &MZ6p;nKLcLc:!0~@#ʇq}B/.+qmeĀUifId@s`lD"{r=
oi{^ )?$t*tѭ+~#\b{܊vxVkA~3&vӦIPA*^+ MKJRmRki!P͛x"CQ=x"A1eThQ
}a2͛}oΛׁ{&@Rܖ1?Wv\.]Q[z.J/v d0D 4ෲj3ObMY><a~eԔxUVс3+B`Ai8Faj S
$NNiҹ?O>9?ҤXÚKj=j1[zl>l
jM@3Ojt߾9c\v<Ll&)R9&0:ӉGSV_.yD):sa&kYF/ez74ɖ4Ɖ.iIyyuDOhVOQ{m
OH'A,.LЙ
Pq3waF7Q3?Lsh6&KWawWrTeB룕)5V )~= ah'O8Pn3sw7{l6I^SҐ)q+14[K%v&]B (o4PPi7_R!z˽rÝoΙ9?sfv7=o_M47:!c%%Fv7顢?cz
Fi;t_M"`88A(ؒvbŸsy?םS}_pϔ ٶlg0,30߶$ڧ3b
k/6Ob泰tFWSunZNE#[];A@PzA&vIv#"6Gmt6=bo<@Wupt>lTP+rGFFe)=;r'j_I?X_DH,rhRwr.F{z0'⟈vceNmjǐ3tu=+W#@}WlgZ4ݾ,.7QW!KDv>S,d'}mTarO?i{HrE;ީf%ZҲ+?<˔
E'7͖QlQyW
=ˏ7ZgAxƂr;O[Z3g+ ,?[\%~i.܌qJK#A9˧~psGs>r4Nzu+cߴ~lע~ߐd)~MU3<,QW7mR>ٝ7 ï8yuGoNF[V.?)ǟ#A(L7:LHʗQ"ORTuQ/V9К&NLd+;ʅ.^AK
~iԘ浃jGk(EEa0eLbb)Pm@ҐMŰ@4Aì" t2s}q ye pûw;op"a.[s*+;_(±)rpg[cx!4lv~rѿ!?\4lwy{ .pfz{0q]1i$nnijs3sip )֏lw4@�$)zg�xr�@5!5j)l;da;k_w3ҵ fx:7cd]b( h]kv ba+[g4d1dĉnjp%rqk6֠jlp_tb l="C䊡950IOdSCG&`mLs?fP}]kerM^2."`։Ȕ*Ѣna/w$l" gw s~\3__iflnhj5mdÔ$kcfa"{hx"#f)(oqn .ufi~d>{4[`uw wuk ӏֱjDXux)vyp6ɢ_nЕuҘnj5akzzQm>
L;πQo>Zjl7oӐbxJ=0;neD'^:VLt(m9^0xU_Ձ;//\5x!9o39W7g&Q'>ײjde96oxVkAofM4Ab HE+Xhi=IeK"٤mR<AYoD=)?P/уm{36l@B27c{7V<}t$ZCÊRzh.7hSqCˌl;A80eom]jx%>;~wJfO9
%p&z̐Spmi=݉O0S]BF=ѩomϤ]zs/'F cqa@W^R&H[^k烈ŹJ2]sFfb
dT*Tz5>p6~l+eherrOI9;Esg(j6=fK4w^Ss)fK]1' cZaҟq92XkoRAOj*%Xf1mBX(AI5jRUϡ_!xTj18{WZdsN]pr/__j&7=%srDDږ]ꛡ&j&S4Vۂ kcSOǟDhp8H.J( c&Q31oƟQU$Wj(MLZMlQV3s}7$3qx?{9ݷruȠ *iT5NR)9Gm#{.r/G!OQC"(ؕj0cl]t.ۗTخh_u+&x3<@$/n23zrW
S~Fя~}ڧ3֔}?e߅'w7O^1ښB"b "{ːA~(UbJzj]=x}\e>?ʽڧ_Mp8Ze*qdz ڟ=g]G: ڈa&/oܖT8Rg\)_(zʡ^UJϐbyąS
p71ܪf)[4IBK~~kI5rT$LptjL:l.o^P؛CB_`
o&M?8qH(W*^@LٛdSiZx
o8;%5sƷkOe
]b
o:2S(W7_j5p{Ȅ?wǬp/>⮌e4/˸E
x!kqؓG,B(& ;Nk5H/5^Q
nۅ`L0fF7C(+`Q(o6
Fm
sE#!Pv(eYXiJ\RV10{`Hne~\,%fs#kwlͺr02=G(,hЖ%i~vnfKC( !Y5^=V?NAKN[n8rNv"`زõ
<[ʁ/03va3^(0
2= ,
7ҺEquation Equation.30,Microsoft Equation 3.008պEquation Equation.30Onscreen Showcarnegie mellon universitys2.2 7Times New RomanComic Sans MSSymbolcmsy10Courier New MT ExtraTimesDefault DesignMicrosoft Equation 3.0$15853:Algorithms in the Real WorldIndexing and Searching OutlineIndexing and Searching OutlineBasic ModelHow big is an Index?Sizes over timePrecision and RecallPrecision and RecallMain ApproachesQueriesTechnique used Across MethodsOther MethodsIndexing and Searching OutlineDocuments as Bipartite GraphDocuments as Bipartite Graph1. Space for Posting ListsSome CodesGlobal vs. Local ProbabilitiesPerformance2. Accessing the Lexicon
Front CodingPrefix and Wildcard QueriesngramRotated Lexicon3. Merging Posting Lists!Union, Intersection, and Merging!Union, Intersection, and MergingMerging: Upper boundsSplit and JoinTime for Split and JoinWill also useUnion with Split and JoinRuntime of Union!Intersection with Split and JoinEfficient Split and JoinTreapsLeft Spinal Treap AnalysisAnalysis ContinuedAnalysis ContinuedAnd back to Posting ListsCase Study: AltaVistaAltaVista: the indexAltaVista: the lexiconAltaVista: query merging!AltaVista: query merging (cont.)Fonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles.}ww0$_O2iGuy BlellochGuy Blelloch 3. 2
o5. 3.72
853:Algorithms in the Real World$
%
2
.3=ticrosoft Equation 3.008պEquation Equation.30,Microsoft Equation 3.009ֺEquation Equation.30,Microsoft Equation 3.00:Equation Equation.30,Microsoft Equation 3.00;غEquation Equation.30,Microsoft Equation 3.00wٺEquation Equation.30,Microsoft Equation 3.00=ںEquation Equation.30,Microsoft Equation 3.00>ۺEquation Equation.30,Microsoft Equation 3.00?ܺEquation Equation.30,Microsoft Equation 3.00@ݺEquation Equation.30,Microsoft Equation 3.00AEquation Equation.30,Microsoft Equation 3.00M&ߺEquation Equation.30,Microsoft Equation 3.00y'Equation Equation.30,Microsoft Equation 3.0/0DTimes New Romanpƺ`H!0`hz0DComic Sans MSnpƺ`H!0`hz0B DSymbolans MSnpƺ`H!0`hz00Dcmsy10ans MSnpƺ`H!0`hz0"@DCourier NewSnpƺ`H!0`hz01PDMT ExtraewSnpƺ`H!0`hz0`DTimesraewSnpƺ`H!0`hz0X
a.
@n?" dd@ @@``
1 R
&)\
#sK`H;
2 4 D
"$%
+
+,./011/$2$4n3KLƠ̈́2$=pR
nƵm2$\.Nn}l2$ϔsS9ȷ'r"2$g$muk[LfTh2$vpKil=o2$/J;Z;T4 2$zzF7\⑃'2$sz~$,d
2$gy,[ZC./vC2$76 lW4&2$^epa=6, Basic ModelApplications:
Web, mail and dictionary searches
Law and patent searches
Information filtering (e.g., NYT articles)
Goal: Speed, Space, Accuracy, Dynamic Updatese.eHow big is an Index?wSep 2003, self proclaimed sizes (gg = google, atw = alltheweb, ink = inktomi, tma = teoma)
Source: Search Engine Watchxx! 3/Sizes over time
*'Precision and Recall%Typically a tradeoff between the two.+(Precision and Recall8Does the black or the blue circle have higher precision?Main Approaches
Full Text Searching
e.g. grep, agrep (used by many mailers)
Inverted File Indices
good for short queries
used by most search engines
Signature Files
good for longer queries with many terms
Vector Space Models
good for better accuracy
used in clustering, SVD, & (3(5(3(5,QueriesTypes of Queries on Multiple terms
boolean (and, or, not, andnot)
proximity (adj, within <n>)
keyword sets
in relation to other documents
And within each term
prefix matches
wildcards
edit distance boundsN%g.%g.>%
!"#$%&'()*+,./012345789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{}~
1 !"#$%&'()*+w023456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnorstuvxyz{}Root EntrydO)0}qPictures\jCurrent UserJDSummaryInformation(<PowerPoint Document(6s2DocumentSummaryInformation8,Microsoft Equation 3.009ֺEquation Equation.30,Microsoft Equation 3.00:Equation Equation.30,Microsoft Equation 3.00;غEquation Equation.30,Microsoft Equation 3.00wٺEquation Equation.30,Microsoft Equation 3.00=ںEquation Equation.30,Microsoft Equation 3.00>ۺEquation Equation.30,Microsoft Equation 3.00?ܺEquation Equation.30,Microsoft Equation 3.00@ݺEquation Equation.30,Microsoft Equation 3.00AEquation Equation.30,Microsoft Equation 3.00M&ߺEquation Equation.30,Microsoft Equation 3.00y'Equation Equation.30,Microsoft Equation 3.0/0DTimes New Romanpƺ`H!0`hz0DComic Sans MSnpƺ`H!0`hz0B DSymbolans MSnpƺ`H!0`hz00Dcmsy10ans MSnpƺ`H!0`hz0"@DCourier NewSnpƺ`H!0`hz01PDMT ExtraewSnpƺ`H!0`hz0`DTimesraewSnpƺ`H!0`hz0X
a.
@n?" dd@ @@``
1 R
&)\
#sK`H;
2 4 D
"$%
+
+,./011/$2$4n3KLƠ̈́2$=pR
nƵm2$\.Nn}l2$ϔsS9ȷ'r"2$g$muk[LfTh2$vpKil=o2$/J;Z;T4 2$zzF7\⑃'2$sz~$,d
2$gy,[ZC./vC2$76 lW4&2$^epa=6, Basic ModelApplications:
Web, mail and dictionary searches
Law and patent searches
Information filtering (e.g., NYT articles)
Goal: Speed, Space, Accuracy, Dynamic Updatese.eHow big is an Index?wSep 2003, self proclaimed sizes (gg = google, atw = alltheweb, ink = inktomi, tma = teoma)
Source: Search Engine Watchxx! 3/Sizes over time
*'Precision and Recall%Typically a tradeoff between the two.+(Precision and Recall8Does the black or the blue circle have higher precision?Main Approaches
Full Text Searching
e.g. grep, agrep (used by many mailers)
Inverted File Indices
good for short queries
used by most search engines
Signature Files
good for longer queries with many terms
Vector Space Models
good for better accuracy
used in clustering, SVD, & (3(5(3(5,QueriesTypes of Queries on Multiple terms
boolean (and, or, not, andnot)
proximity (adj, within <n>)
keyword sets
in relation to other documents
And within each term
prefix matches
wildcards
edit distance boundsN%g.%g.>%
}Technique used Across MethodsCase folding
London > london
Stemming
compress = compression = compressed
(several offtheshelf English Language stemmers are freely available)
Stop words
to, the, it, be, or, &
how about to be or not to be
Thesaurus
fast > rapid
ZZ ZkZZ8ZZZZ
k8 ,U s)&
Other Methods@Document Ranking:
Returning an ordered ranking of the results
A priori ranking of documents (e.g. Google)
Ranking based on closeness to query
Ranking based on relevance feedback
Clustering and Dimensionality Reduction
Return results grouped into clusters
Return results even if query terms does not appear but are clustered with documents that do
Document Preprocessing
Removing near duplicates
Detecting spam>ZxZ*ZZZ(Z,x*(,b4"!Indexing and Searching OutlineIntroduction: model, query types
Inverted File Indices:
Index compression
The lexicon
Merging terms (unions and intersections)
Vector Models:
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Duplicate Removal:f8GdG6, Documents as Bipartite GraphCalled an Inverted File index
Can be stored using adjacency lists, also called
posting lists (or files)
inverted file entry
Example size of TREC database(Text REtrieval Conference)
538K terms
742K documents
333,856K edges
For the web, multiply by 10K`Q:)Q:) S Documents as Bipartite GraphImplementation Issues:
1. Space for posting lists
these take almost all the space
2. Access to lexicon
btrees, tries, hashing
prefix and wildcard queries
3. Merging posting list
multiple term queries2 3 3g[
1. Space for Posting ListsoPosting lists can be as large as the document data
saving space and the time to access the space is critical for performance
We can compress the lists,
but, we need to uncompress on the fly.
Difference encoding:
Lets say the term elephant appears in documents:
[3, 5, 20, 21, 23, 76, 77, 78]
then the difference code is
[3, 2, 15, 1, 2, 53, 1, 1]3ZJZXZ1Z&ZZ(Z3JC&(
d
Some CodesGamma code:
if most significant bit of n is in location k, then
gamma(n) = 0k n[k..0]
2 log(n) 1 bits
Delta code:
gamma(k)n[k..0]
2 log(log(n)) + log(n)  1 bits
Frequency coded:
base on actual probabilities of each distance_0.
C
0.Global vs. Local ProbabilitiesGlobal:
Count # of occurrence of each distance
Use Huffman or arithmetic code
Local:
generate counts for each list
elephant: [3, 2, 1, 2, 53, 1, 1]
Problem: counts take too much space
Solution: batching
group into buckets by blog(length) cFF,tr
Performance]Bits per edge based on the TREC document collection
Total size = 333M * .66 bytes = 222Mbytes
2. Accessing the LexiconWe all know how to store a dictionary, BUT&
it is best if lexicon fits in memorycan we avoid storing all characters of all words
what about prefix or wildcard queries?
Some possible data structures
Front Coding
Tries
Perfect Hashing
Btrees`,+*+Front CodingFFor large lexicons can save 75% of space
But what about random access?Prefix and Wildcard QueriesdPrefix queries
Handled by all access methods except hashing
Wildcard queries
ngram
rotated lexicon
bngram^Consider every block of n characters in a term:
e.g. 2gram of jezebel > $j,je,ez,ze,eb,el,l$>0/0PMRotated LexiconUConsider every rotation of a term:
e.g. jezebel > $jezebel, l$jezebe, el$jezeb, bel$jeze
Now store lexicon of all rotations
Given a query find longest contiguous block (with rotation) and search for it:
e.g. j*el > search for el$j in lexicon
Note that each lexicon entry corresponds to a single term
e.g. ebel$jez can only mean jezebel#;r(:$#5r:tC3. Merging Posting ListsLets say queries are expressions over:
and, or, andnot
View the list of documents for a term as a set:
Then
e1 and e2 > S1 intersect S2
e1 or e2 > S1 union S2
e1 andnot e2 > S1 diff S2
Some notes:
the sets are ordered in the posting lists
S1 and S2 can differ in size substantially
might be good to keep intermediate results
persistence is important
(ZZ5ZPZZZ(5 f,1n Union, Intersection, and MergingGiven two sets of length n and m how long does it take for intersection, union and set difference?
Assume elements are taken from a total order (<)
Very similar to merging two sets A and B, how long does this take?
What is a lower bound?6(% Union, Intersection, and MergingLower Bound:
There are n elements of A and n + m positions in the output they could belong
Number of possible interleavings:
Assuming comparison based model, the decision tree has that many leaves
Max depth is at least log of number of leaves
Assuming m < n:
N"`PPv P
n
Merging: Upper bounds[Brown and Tarjan show anO(m log((n + m)/m)) upper bound using 23 trees with cross links and parent pointers. Very messy.
We will take different approach, and base an implementation on two operations: split and join
Split and Join can then be implemented on many different kinds of trees. We will describe an implementation based on treaps.B\{R, DSplit and Join!Split(S,v) : Split S into two sets S< = {s 2 S  s < v} and S> = {s 2 S  s > v}. Also return a flag which is true if v 2 S.
Split({7,9,15,18,22}, 18) ! {7,9,15},{22},True
Join(S<, S>) : Assuming 8 k< 2 S<, k> in S> : k< < k>returns S< U S>
Join({7,9,11},{14,22}) ! {7,9,11,14,22}(/H(
6
$#Time for Split and JoinSplit(S,v) ! (S<, S>),flag Join(S<, S>) ! S
Naively:
T = O(S)
Less Naively:
T = O(logS)
What we want:
T = O(log(min(S<, S>)))  can be shown
T = O(log S<)  will actually suffice@`
%$
Will also useisEmpty(S) ! boolean
True if the set S is empty
first(S) ! e
returns the least element of S
first({2,6,9,11,13}) ! 2
{e} ! S
creates a singleton set from an element
We assume they can both run in O(1) time.
An ADT with 5 operations!&ZZ
Z8ZZ(ZFZ
(F$Union with Split and JoinUnion(S1, S2) =
if isEmpty(S1) then return S2
else
(S2<, S2>, fl) = Split(S2, first(S1))
return Join(S2<, Union(S2>, S1))~Z
eRuntime of UnionTunion = O(i log oi + i log oi)
Splits Joins
Since the logarithm function is concave, this is maximized when blocks are as close as possible to equal size, therefore
Tunion = O(i=1m log d n/m + 1 e)
= O(m log ((n+m)/m)) 3 Intersection with Split and JoinIntersect(S1, S2) =
if isempty(S1) then return
else
(S2<, S2>, flag) = Split(S2, first(S1))
if flag then
return Join({first(S1)}, Intersect(S2>, S1))
else
return Intersect(S2>, S1)
Z
Efficient Split and JoinKRecall that we want: T = O(log S<)
How do we implement this efficiently?DL&TreapsDEvery key is given a random priority.
keys are stored inorder
priorities are stored in heaporder
e.g. (key,priority) : (1,23), (4,40), (5,11), (9,35), (12,30)J(Z=Z>Z(=(Left Spinal TreapTime to split = length of path from Start to split location l
We will show that this is O(log L) in the expected case, where L is the number of keys between Start and l (inclusive). 10 in the example.
Time to Join is the samelZ<_"Analysis
Analysis ContinuedProof:
i is an ancestor of j iff i has a greater priority than all elements between i and j, inclusive.
there are ij+1 such elements each with equal probability of having the highest priority.6" Analysis ContinuedCan similarly show that: 6And back to Posting Lists We showed how to take Unions and Intersections, but Treaps are not very space efficient.
Idea: if priorities are in the range [0..1) then any node with priority < 1  a is stored compressed.
a represents fraction of uncompressed nodes.6,4,)Case Study: AltaVistaiHow AltaVista implements indexing and searching, or at least how they did in 1998.
Based on a talk by A. Broder and M. Henzinger from AltaVista. Henzinger is now at Google, Broder is at IBM.
The index (posting lists)
The lexicon
Query merging (or, and, andnot queries)
The size of their whole index is about 30% the size of the original documents it encodes.6N[N[ti
Id*AltaVista: the indexAll documents are concatenated together into one sequence of terms (stop words removed).
This allows proximity queries
Other companies do not do this, but do proximity tests in a postprocessing phase
Tokens separate documents
Posting lists contain pointers to individual terms in the single concatenated document.
Difference encoded
Use Front Coding for the Lexicon`YZ!YZ!/,AltaVista: the lexiconThe Lexicon is front coded.
Allows prefix queries, but requires prefix to be at least 3 characters (otherwise too many hits)&aa0AltaVista: query merging\Support expressions on terms involving:AND, OR, ANDNOT and NEAR
Implement posting list with an abstract data type called an Index Stream Reader (ISR).
Supports the following operations:
loc() : current location in ISR
next() : advance to the next location
seek(k) : advance to first location past kq, $1. AltaVista: query merging (cont.)Queries are decomposed into the following operations:
Create : term ! ISR ISR for the term
Or : ISR * ISR ! ISR Union
And : ISR * ISR ! ISR Intersection
AndNot : ISR * ISR ! ISR Set difference
Near : ISR * ISR ! ISR Intersection, almost
Note that all can be implemented with our Treap Data structure.
I believe (from private conversations) that they use a two level hierarchy that approximates the advantages of balanced trees (e.g. treaps).
$66$>
H0(
Hx
H c$ 0
x
H c$
H
H0h ? 3fr793^(0
2= ,
7ҺEquation Equation.30,M
!"#$%&'()*+,./0123456789:;<=>?@ABCDEFGHIKOh+'0hp
,4%15499: Algorithms and Applicationsros
Guy Blellochrit
Guy Blellochrit99 Microsoft PowerPointnd @<@ 뫉( @/@`K}G g 7& &&#TNPP02OMi
&
TNPP &&TNPP
 !&G&j}w@c
}ww0 &Gy& @Times New Roman}ww0 .
2
15 . . 2
5. .2
853
.&y& .2
=Page . . 2
f1 .0 @BComic Sans MS
}ww0 3.
2
<15. 3. 2
o5. 3.72
853:Algorithms in the Real World$
%
2
.3=t3 3v<Q10 3t@BComic Sans MS
k}ww0 t.42
SIndexing and Searching I (how
. t.2
Sngoogle
. t.
2
Sand . t.2
zthe likes work)
."System
0&TNPP &՜.+,0
$d
}Technique used Across MethodsCase folding
London > london
Stemming
compress = compression = compressed
(several offtheshelf English Language stemmers are freely available)
Stop words
to, the, it, be, or, &
how about to be or not to be
Thesaurus
fast > rapid
ZZ ZkZZ8ZZZZ
k8 ,U s)&
Other Methods@Document Ranking:
Returning an ordered ranking of the results
A priori ranking of documents (e.g. Google)
Ranking based on closeness to query
Ranking based on relevance feedback
Clustering and Dimensionality Reduction
Return results grouped into clusters
Return results even if query terms does not appear but are clustered with documents that do
Document Preprocessing
Removing near duplicates
Detecting spam>ZxZ*ZZZ(Z,x*(,b4"!Indexing and Searching OutlineIntroduction: model, query types
Inverted File Indices:
Index compression
The lexicon
Merging terms (unions and intersections)
Vector Models:
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Duplicate Removal:f8GdG6, Documents as Bipartite GraphCalled an Inverted File index
Can be stored using adjacency lists, also called
posting lists (or files)
inverted file entry
Example size of TREC database(Text REtrieval Conference)
538K terms
742K documents
333,856K edges
For the web, multiply by 10K`Q:)Q:) S Documents as Bipartite GraphImplementation Issues:
1. Space for posting lists
these take almost all the space
2. Access to lexicon
btrees, tries, hashing
prefix and wildcard queries
3. Merging posting list
multiple term queries2 3 3g[
1. Space for Posting ListsoPosting lists can be as large as the document data
saving space and the time to access the space is critical for performance
We can compress the lists,
but, we need to uncompress on the fly.
Difference encoding:
Lets say the term elephant appears in documents:
[3, 5, 20, 21, 23, 76, 77, 78]
then the difference code is
[3, 2, 15, 1, 2, 53, 1, 1]3ZJZXZ1Z&ZZ(Z3JC&(
d
Some CodesGamma code:
if most significant bit of n is in location k, then
gamma(n) = 0k n[k..0]
2 log(n) 1 bits
Delta code:
gamma(k)n[k..0]
2 log(log(n)) + log(n)  1 bits
Frequency coded:
base on actual probabilities of each distance_0.
C
0.Global vs. Local ProbabilitiesGlobal:
Count # of occurrence of each distance
Use Huffman or arithmetic code
Local:
generate counts for each list
elephant: [3, 2, 1, 2, 53, 1, 1]
Problem: counts take too much space
Solution: batching
group into buckets by blog(length) cFF,tr
Performance]Bits per edge based on the TREC document collection
Total size = 333M * .66 bytes = 222Mbytes
2. Accessing the LexiconWe all know how to store a dictionary, BUT&
it is best if lexicon fits in memorycan we avoid storing all characters of all words
what about prefix or wildcard queries?
Some possible data structures
Front Coding
Tries
Perfect Hashing
Btrees`,+*+Front CodingFFor large lexicons can save 75% of space
But what about random access?Prefix and Wildcard QueriesdPrefix queries
Handled by all access methods except hashing
Wildcard queries
ngram
rotated lexicon
bngram^Consider every block of n characters in a term:
e.g. 2gram of jezebel > $j,je,ez,ze,eb,el,l$>0/0PMRotated LexiconUConsider every rotation of a term:
e.g. jezebel > $jezebel, l$jezebe, el$jezeb, bel$jeze
Now store lexicon of all rotations
Given a query find longest contiguous block (with rotation) and search for it:
e.g. j*el > search for el$j in lexicon
Note that each lexicon entry corresponds to a single term
e.g. ebel$jez can only mean jezebel#;r(:$#5r:tC3. Merging Posting ListsLets say queries are expressions over:
and, or, andnot
View the list of documents for a term as a set:
Then
e1 and e2 > S1 intersect S2
e1 or e2 > S1 union S2
e1 andnot e2 > S1 diff S2
Some notes:
the sets are ordered in the posting lists
S1 and S2 can differ in size substantially
might be good to keep intermediate results
persistence is important
(ZZ5ZPZZZ(5 f,1n Union, Intersection, and MergingGiven two sets of length n and m how long does it take for intersection, union and set difference?
Assume elements are taken from a total order (<)
Very similar to merging two sets A and B, how long does this take?
What is a lower bound?6(% Union, Intersection, and MergingLower Bound:
There are n elements of A and n + m positions in the output they could belong
Number of possible interleavings:
Assuming comparison based model, the decision tree has that many leaves
Max depth is at least log of number of leaves
Assuming m < n:
N"`PPv P
n
Merging: Upper bounds[Brown and Tarjan show anO(m log((n + m)/m)) upper bound using 23 trees with cross links and parent pointers. Very messy.
We will take different approach, and base an implementation on two operations: split and join
Split and Join can then be implemented on many different kinds of trees. We will describe an implementation based on treaps.B\{R, DSplit and Join!Split(S,v) : Split S into two sets S< = {s 2 S  s < v} and S> = {s 2 S  s > v}. Also return a flag which is true if v 2 S.
Split({7,9,15,18,22}, 18) ! {7,9,15},{22},True
Join(S<, S>) : Assuming 8 k< 2 S<, k> in S> : k< < k>returns S< U S>
Join({7,9,11},{14,22}) ! {7,9,11,14,22}(/H(
6
$#Time for Split and JoinSplit(S,v) ! (S<, S>),flag Join(S<, S>) ! S
Naively:
T = O(S)
Less Naively:
T = O(logS)
What we want:
T = O(log(min(S<, S>)))  can be shown
T = O(log S<)  will actually suffice@`
%$
Will also useisEmpty(S) ! boolean
True if the set S is empty
first(S) ! e
returns the least element of S
first({2,6,9,11,13}) ! 2
{e} ! S
creates a singleton set from an element
We assume they can both run in O(1) time.
An ADT with 5 operations!&ZZ
Z8ZZ(ZFZ
(F$Union with Split and JoinUnion(S1, S2) =
if isEmpty(S1) then return S2
else
(S2<, S2>, fl) = Split(S2, first(S1))
return Join(S2<, Union(S2>, S1))~Z
eRuntime of UnionTunion = O(i log oi + i log oi)
Splits Joins
Since the logarithm function is concave, this is maximized when blocks are as close as possible to equal size, therefore
Tunion = O(i=1m log d n/m + 1 e)
= O(m log ((n+m)/m)) 3 Intersection with Split and JoinIntersect(S1, S2) =
if isempty(S1) then return
else
(S2<, S2>, flag) = Split(S2, first(S1))
if flag then
return Join({first(S1)}, Intersect(S2>, S1))
else
return Intersect(S2>, S1)
Z
Efficient Split and JoinKRecall that we want: T = O(log S<)
How do we implement this efficiently?DL&TreapsDEvery key is given a random priority.
keys are stored inorder
priorities are stored in heaporder
e.g. (key,priority) : (1,23), (4,40), (5,11), (9,35), (12,30)J(Z=Z>Z(=(Left Spinal TreapTime to split = length of path from Start to split location l
We will show that this is O(log L) in the expected case, where L is the number of keys between Start and l (inclusive). 10 in the example.
Time to Join is the samelZ<_"Analysis
Analysis ContinuedProof:
i is an ancestor of j iff i has a greater priority than all elements between i and j, inclusive.
there are ij+1 such elements each with equal probability of having the highest priority.6" Analysis ContinuedCan similarly show that: 6And back to Posting Lists We showed how to take Unions and Intersections, but Treaps are not very space efficient.
Idea: if priorities are in the range [0..1) then any node with priority < 1  a is stored compressed.
a represents fraction of uncompressed nodes.6,4,)Case Study: AltaVistaiHow AltaVista implements indexing and searching, or at least how they did in 1998.
Based on a talk by A. Broder and M. Henzinger from AltaVista. Henzinger is now at Google, Broder is at IBM.
The index (posting lists)
The lexicon
Query merging (or, and, andnot queries)
The size of their whole index is about 30% the size of the original documents it encodes.6N[N[ti
Id*AltaVista: the indexAll documents are concatenated together into one sequence of terms (stop words removed).
This allows proximity queries
Other companies do not do this, but do proximity tests in a postprocessing phase
Tokens separate documents
Posting lists contain pointers to individual terms in the single concatenated document.
Difference encoded
Use Front Coding for the Lexicon`YZ!YZ!/,AltaVista: the lexiconThe Lexicon is front coded.
Allows prefix queries, but requires prefix to be at least 3 characters (otherwise too many hits)&aa0AltaVista: query merging\Support expressions on terms involving:AND, OR, ANDNOT and NEAR
Implement posting list with an abstract data type called an Index Stream Reader (ISR).
Supports the following operations:
loc() : current location in ISR
next() : advance to the next location
seek(k) : advance to first location past kq, $1. AltaVista: query merging (cont.)Queries are decomposed into the following operations:
Create : term ! ISR ISR for the term
Or : ISR * ISR ! ISR Union
And : ISR * ISR ! ISR Intersection
AndNot : ISR * ISR ! ISR Set difference
Near : ISR * ISR ! ISR Intersection, almost
Note that all can be implemented with our Treap Data structure.
I believe (from private conversations) that they use a two level hierarchy that approximates the advantages of balanced trees (e.g. treaps).
$66$>r?9%93^(0
2= ,
7ҺEquation Equation.30,Microsoft Equation 3.008պEquation Equation.30,Microsoft Equation 3.009ֺEquation Equation.30,Microsoft Equation 3.00:Equation Equation.30,Microsoft Equation 3.00;غEquation Equation.30,Microsoft Equation 3.00wٺEquation Equation.30,Microsoft Equation 3.00=ںEquation Equation.30,Microsoft Equation 3.00>ۺEquation Equation.30,Microsoft Equation 3.00?ܺEquation Equation.30,Microsoft Equation 3.00@ݺEquation Equation.30,Microsoft Equation 3.00AEquation Equation.30,Microsoft Equation 3.00M&ߺEquation Equation.30,Microsoft Equation 3.00y'Equation Equation.30,Microsoft Equation 3.0/0DTimes New Romanpƺ`H!0`hz0DComic Sans MSnpƺ`H!0`hz0B DSymbolans MSnpƺ`H!0`hz00Dcmsy10ans MSnpƺ`H!0`hz0"@DCourier NewSnpƺ`H!0`hz01PDMT ExtraewSnpƺ`H!0`hz0`DTimesraewSnpƺ`H!0`hz0X
a.
@n?" dd@ @@``
1 R
&)\
#sK`H;
2 4 D
"$%
+
+,./011/$2$4n3KLƠ̈́2$=pR
nƵm2$\.Nn}l2$ϔsS9ȷ'r"2$g$muk[LfTh2$vpKil=o2$/J;Z;T4 2$zzF7\⑃'2$sz~$,d
2$gy,[ZC./vC2$76 lW4&2$^epa=6, Basic ModelApplications:
Web, mail and dictionary searches
Law and patent searches
Information filtering (e.g., NYT articles)
Goal: Speed, Space, Accuracy, Dynamic Updatese.eHow big is an Index?wSep 2003, self proclaimed sizes (gg = google, atw = alltheweb, ink = inktomi, tma = teoma)
Source: Search Engine Watchxx! 3/Sizes over time
*'Precision and Recall%Typically a tradeoff between the two.+(Precision and Recall8Does the black or the blue circle have higher precision?Main Approaches
Full Text Searching
e.g. grep, agrep (used by many mailers)
Inverted File Indices
good for short queries
used by most search engines
Signature Files
good for longer queries with many terms
Vector Space Models
good for better accuracy
used in clustering, SVD, & (3(5(3(5,QueriesTypes of Queries on Multiple terms
boolean (and, or, not, andnot)
proximity (adj, within <n>)
keyword sets
in relation to other documents
And within each term
prefix matches
wildcards
edit distance boundsN%g.%g.>%
}Technique used Across MethodsCase folding
London > london
Stemming
compress = compression = compressed
(several offtheshelf English Language stemmers are freely available)
Stop words
to, the, it, be, or, &
how about to be or not to be
Thesaurus
fast > rapid
ZZ ZkZZ8ZZZZ
k8 ,U s)&
Other Methods@Document Ranking:
Returning an ordered ranking of the results
A priori ranking of documents (e.g. Google)
Ranking based on closeness to query
Ranking based on relevance feedback
Clustering and Dimensionality Reduction
Return results grouped into clusters
Return results even if query terms does not appear but are clustered with documents that do
Document Preprocessing
Removing near duplicates
Detecting spam>ZxZ*ZZZ(Z,x*(,b4"!Indexing and Searching OutlineIntroduction: model, query types
Inverted File Indices:
Index compression
The lexicon
Merging terms (unions and intersections)
Vector Models:
Latent Semantic Indexing:
Link Analysis: PageRank (Google), HITS
Duplicate Removal:f8GdG6, Documents as Bipartite GraphCalled an Inverted File index
Can be stored using adjacency lists, also called
posting lists (or files)
inverted file entry
Example size of TREC database(Text REtrieval Conference)
538K terms
742K documents
333,856K edges
For the web, multiply by 10K`Q:)Q:) S Documents as Bipartite GraphImplementation Issues:
1. Space for posting lists
these take almost all the space
2. Access to lexicon
btrees, tries, hashing
prefix and wildcard queries
3. Merging posting list
multiple term queries2 3 3g[
1. Space for Posting ListsoPosting lists can be as large as the document data
saving space and the time to access the space is critical for performance
We can compress the lists,
but, we need to uncompress on the fly.
Difference encoding:
Lets say the term elephant appears in documents:
[3, 5, 20, 21, 23, 76, 77, 78]
then the difference code is
[3, 2, 15, 1, 2, 53, 1, 1]3ZJZXZ1Z&ZZ(Z3JC&(
d
Some CodesGamma code:
if most significant bit of n is in location k, then
gamma(n) = 0k n[k..0]
2 log(n) 1 bits
Delta code:
gamma(k)n[k..0]
2 log(log(n)) + log(n)  1 bits
Frequency coded:
base on actual probabilities of each distance_0.
C
0.Global vs. Local ProbabilitiesGlobal:
Count # of occurrence of each distance
Use Huffman or arithmetic code
Local:
generate counts for each list
elephant: [3, 2, 1, 2, 53, 1, 1]
Problem: counts take too much space
Solution: batching
group into buckets by blog(length) cFF,tr
Performance]Bits per edge based on the TREC document collection
Total size = 333M * .66 bytes = 222Mbytes
2. Accessing the LexiconWe all know how to store a dictionary, BUT&
it is best if lexicon fits in memorycan we avoid storing all characters of all words
what about prefix or wildcard queries?
Some possible data structures
Front Coding
Tries
Perfect Hashing
Btrees`,+*+Front CodingFFor large lexicons can save 75% of space
But what about random access?Prefix and Wildcard QueriesdPrefix queries
Handled by all access methods except hashing
Wildcard queries
ngram
rotated lexicon
bngram^Consider every block of n characters in a term:
e.g. 2gram of jezebel > $j,je,ez,ze,eb,el,l$>0/0PMRotated LexiconUConsider every rotation of a term:
e.g. jezebel > $jezebel, l$jezebe, el$jezeb, bel$jeze
Now store lexicon of all rotations
Given a query find longest contiguous block (with rotation) and search for it:
e.g. j*el > search for el$j in lexicon
Note that each lexicon entry corresponds to a single term
e.g. ebel$jez can only mean jezebel#;r(:$#5r:tC3. Merging Posting ListsLets say queries are expressions over:
and, or, andnot
View the list of documents for a term as a set:
Then
e1 and e2 > S1 intersect S2
e1 or e2 > S1 union S2
e1 andnot e2 > S1 diff S2
Some notes:
the sets are ordered in the posting lists
S1 and S2 can differ in size substantially
might be good to keep intermediate results
persistence is important
(ZZ5ZPZZZ(5 f,1n Union, Intersection, and MergingGiven two sets of length n and m how long does it take for intersection, union and set difference?
Assume elements are taken from a total order (<)
Very similar to merging two sets A and B, how long does this take?
What is a lower bound?6(% Union, Intersection, and MergingLower Bound:
There are n elements of A and n + m positions in the output they could belong
Number of possible interleavings:
Assuming comparison based model, the decision tree has that many leaves
Max depth is at least log of number of leaves
Assuming m < n:
N"`PPv P
n
Merging: Upper bounds[Brown and Tarjan show anO(m log((n + m)/m)) upper bound using 23 trees with cross links and parent pointers. Very messy.
We will take different approach, and base an implementation on two operations: split and join
Split and Join can then be implemented on many different kinds of trees. We will describe an implementation based on treaps.B\{R, DSplit and Join!Split(S,v) : Split S into two sets S< = {s 2 S  s < v} and S> = {s 2 S  s > v}. Also return a flag which is true if v 2 S.
Split({7,9,15,18,22}, 18) ! {7,9,15},{22},True
Join(S<, S>) : Assuming 8 k< 2 S<, k> in S> : k< < k>returns S< U S>
Join({7,9,11},{14,22}) ! {7,9,11,14,22}(/H(
6
$#Time for Split and JoinSplit(S,v) ! (S<, S>),flag Join(S<, S>) ! S
Naively:
T = O(S)
Less Naively:
T = O(logS)
What we want:
T = O(log(min(S<, S>)))  can be shown
T = O(log S<)  will actually suffice@`
%$
Will also useisEmpty(S) ! boolean
True if the set S is empty
first(S) ! e
returns the least element of S
first({2,6,9,11,13}) ! 2
{e} ! S
creates a singleton set from an element
We assume they can both run in O(1) time.
An ADT with 5 operations!&ZZ
Z8ZZ(ZFZ
(F$Union with Split and JoinUnion(S1, S2) =
if isEmpty(S1) then return S2
else
(S2<, S2>, fl) = Split(S2, first(S1))
return Join(S2<, Union(S2>, S1))~Z
eRuntime of UnionTunion = O(i log oi + i log oi)
Splits Joins
Since the logarithm function is concave, this is maximized when blocks are as close as possible to equal size, therefore
Tunion = O(i=1m log d n/m + 1 e)
= O(m log ((n+m)/m)) 3 Intersection with Split and JoinIntersect(S1, S2) =
if isempty(S1) then return
else
(S2<, S2>, flag) = Split(S2, first(S1))
if flag then
return Join({first(S1)}, Intersect(S2>, S1))
else
return Intersect(S2>, S1)
Z
Efficient Split and JoinKRecall that we want: T = O(log S<)
How do we implement this efficiently?DL&TreapsDEvery key is given a random priority.
keys are stored inorder
priorities are stored in heaporder
e.g. (key,priority) : (1,23), (4,40), (5,11), (9,35), (12,30)J(Z=Z>Z(=(Left Spinal TreapTime to split = length of path from Start to split location l
We will show that this is O(log L) in the expected case, where L is the number of keys between Start and l (inclusive). 10 in the example.
Time to Join is the samelZ<_"Analysis
Analysis ContinuedProof:
i is an ancestor of j iff i has a greater priority than all elements between i and j, inclusive.
there are ij+1 such elements each with equal probability of having the highest priority.6" Analysis ContinuedCan similarly show that: 6And back to Posting Lists We showed how to take Unions and Intersections, but Treaps are not very space efficient.
Idea: if priorities are in the range [0..1) then any node with priority < 1  a is stored compressed.
a represents fraction of uncompressed nodes.6,4,)Case Study: AltaVistaiHow AltaVista implements indexing and searching, or at least how they did in 1998.
Based on a talk by A. Broder and M. Henzinger from AltaVista. Henzinger is now at Google, Broder is at IBM.
The index (posting lists)
The lexicon
Query merging (or, and, andnot queries)
The size of their whole index is about 30% the size of the original documents it encodes.6N[N[ti
Id*AltaVista: the indexAll documents are concatenated together into one sequence of terms (stop words removed).
This allows proximity queries
Other companies do not do this, but do proximity tests in a postprocessing phase
Tokens separate documents
Posting lists contain pointers to individual terms in the single concatenated document.
Difference encoded
Use Front Coding for the Lexicon`YZ!YZ!/,AltaVista: the lexiconThe Lexicon is front coded.
Allows prefix queries, but requires prefix to be at least 3 characters (otherwise too many hits)&aa0AltaVista: query merging\Support expressions on terms involving:AND, OR, ANDNOT and NEAR
Implement posting list with an abstract data type called an Index Stream Reader (ISR).
Supports the following operations:
loc() : current location in ISR
next() : advance to the next location
seek(k) : advance to first location past kq, $1. AltaVista: query merging (cont.)Queries are decomposed into the following operations:
Create : term ! ISR ISR for the term
Or : ISR * ISR ! ISR Union
And : ISR * ISR ! ISR Intersection
AndNot : ISR * ISR ! ISR Set difference
Near : ISR * ISR ! ISR Intersection, almost
Note that all can be implemented with our Treap Data structure.
I believe (from private conversations) that they use a two level hierarchy that approximates the advantages of balanced trees (e.g. treaps).
$66$>rٵ?23