Lab6: A Simple Flickr Crawler

Due Date: Saturday April 4th, 2009 by midnight

Objective:

Once you are done with this lab, you should be familiar with the following:

using perl's primitive data structures such as arrays and hashes
using perl sub routines
executing external commands from a perl script
reading and writing to files
parsing text using perl's regular expression facility (regex)
Creating a customized web page with organized images

Assignment:

Part 1

In this part of the assignment, you will write a perl script which downloads a set of images from the www.flickr.com website based on some keyword(s). You are required to use wget to download the images after retrieving the html. We will use LWP (Library for WWW in Perl) to retrieve the html. A starter file is available at Lab6_starter.pl.

The perl script accepts a list of keywords as a run-time parameters and 2 optional flags. The perl script will pass the keywords to www.flickr.com as a part of your search query. The two optional flags are -n and -t. The -n num_of_images flag tells the perl script to find and download num_of_images images for each keyword while the -t target_dir tells the perl script to save the images to target_dir. Each downloaded image should be identified by a keyword (derived from the run-time parameter) and index number (0, 1, 2, etc.).For example if the command perl lab6.pl -n 10 -t test beach is given, the program will get 10 images with beach keyword from flickr and save them into a folder called test with images named: beach1.jpg, beach2.jpg etc

usage : perl lab6.pl [options] keyword1 keyword2 ...
options:
	-n num_of_images	the number of images to download for each keyword
				(the default is 10)
	-t target_dir		where to put the downloaded images.
				(the default is the current directory)
        -f file_name		the name of the htm file to organize images.
				(the default name is images.htm)

IMPORTANT: image URL's should be saved in an array to be used in Part 2.

Part 2

In this part of the assignment, you will be creating a simple web page to organize the images(four images per row). Keep the images sizes to height=200 width=200 as shown in sample.htm. Each image will be linked to point to actual images from the flickr. You do not need to know a lot about html. But your program must create htm file dynamically based on the key word(s). Take a look at the source code of sample.htm (given in the folder) to see how to create this file. The output shows 8 images from flickr based on search word "beach"and each image is directly linked to the image on flickr. You DO NOT need to download the images but rather getting the image URL's directly encoded in the html file.

Extra Credit

if you are interested in some extra credit (up to 10 points), you can do the following. First enhance the part2 HTML page by including Titles under the picture. Secondly, Extend your perl script to find the 3 most popular tag for the images downloaded. This would require you to extract tags from each image html and rank them and list the top 3 tags present in the images. if there are ties, list all of them.
The top 3 tags for the images downloaded
elephant
monkey
tiger

Getting Started

In your browser of choice go to http://www.flickr.com/ and search for something. On the page with your results, what happened to the URL? What happens when you change the URL directly? What happens to the URL when you go to a second page of results?
Use the man pages for wget to figure out how to specify the name of the output file for your image files.
The GetOptions, mkdir, and chdir are all available to you.
You *MUST* go to the next page if the user requests more images than are available on the first page. think of how to traverse all pages with images by encoding the URL
Image tags in html are img src="http://www.foo.com/bar.jpg". You need to extract these tags to create the file
Are all the images on the page related to the search term? Look at the html and see how the page is structured to get an idea of how you can parse it (hint: class=DetailPic)
Perl is a magical language and the Internet is your friend. Look for pre-built modules that do what you're interested in doing before reinventing the wheel. The "use" statements in the starter code should be a big hint.

lab Requirements

You must have at least two sub routines in your program, getImages and makeHTML (see starter code)
Your program must store image URL's in an array or hash table
You must use regex to parse and extract the URL's
If -f file_name is given, images will NOT be downloaded into the folder but links written to html file file_name
If -f file_name is not given, then images must be downloaded into the folder

Downloading Files

As usual, download files from /afs/andrew.cmu.edu/course/15/123/downloads/lab6

Handing in your Solution

Your solution should be in the form of a .pl file. Submit to /afs/andrew.cmu.edu/course/15/123/handin/lab6

Grading

Your program will be graded according to the rubric.txt given in the downloads folder

FAQ

We always try to maintain an updated FAQ.txt accessible from Bb->labs. Please read the FAQ.txt file if you have any questions. If you cannot find the answer in FAQ.txt please send email to any course staff, cc to guna@andrew

Better Mouse Trap

Do you have a creative idea about how to make this lab more interesting and practical. if so, please write the idea(s) in a file called idea.txt and place it in the handin folder. If we like your idea and then we will seriously consider using that in future labs and more importantly will give you some extra credit in this lab and acknowledgment in future labs.

The Original idea for the lab came from Hassan Rom, CS major, 2006