Lab6: A Simple Flickr Crawler
Due Date: Saturday April 4th, 2009 by midnight
Objective:
Once you are done with this lab, you should be familiar with the following:
- using perl's primitive data structures such as arrays and hashes
- using perl sub routines
- executing external commands from a perl script
- reading and writing to files
- parsing text using perl's regular expression facility (regex)
- Creating a customized web page with organized images
Assignment:
Part 1
In this part of the assignment, you will write a perl script which downloads a set of images from the
www.flickr.com website based on some keyword(s).
You are required to use wget to download the images after retrieving the html.
We will use LWP (Library for WWW in Perl) to retrieve the html.
A starter file is available at Lab6_starter.pl.
The perl script accepts a list of keywords as a run-time parameters and 2 optional flags.
The perl script will pass the keywords to www.flickr.com as a part of
your search query. The two optional flags are -n and -t. The -n num_of_images
flag tells the perl script to find and download num_of_images images for each keyword
while the -t
target_dir tells the perl script to save the images to target_dir. Each downloaded image
should be identified by a keyword (derived from the run-time parameter) and index number (0, 1, 2, etc.).For example
if the command perl lab6.pl -n 10 -t test beach is given, the program will get 10 images with beach
keyword from flickr and save them into a folder called test with images named: beach1.jpg, beach2.jpg etc
usage : perl lab6.pl [options] keyword1 keyword2 ...
options:
-n num_of_images the number of images to download for each keyword
(the default is 10)
-t target_dir where to put the downloaded images.
(the default is the current directory)
-f file_name the name of the htm file to organize images.
(the default name is images.htm)
IMPORTANT: image URL's should be saved in an array to be used in Part 2.
Part 2
In this part of the assignment, you will be creating a simple web page to organize the images(four images
per row). Keep the images sizes to height=200 width=200 as shown in sample.htm.
Each image will be linked to point to actual images from the flickr. You do not need to know a
lot about html. But your program must create htm file dynamically based on the key word(s). Take a look
at the source code of sample.htm (given in the folder) to see how to create this file.
The output shows 8 images from flickr based on search word
"beach"and each image is directly linked to the image on flickr. You DO NOT need to download the images
but rather getting the image URL's directly encoded in the html file.
Extra Credit
if you are interested in some extra credit (up to 10 points), you can do the following. First enhance the part2
HTML page by including Titles under the picture. Secondly, Extend your perl script
to find the 3 most popular tag for the images downloaded. This would require you to
extract tags from each image html and rank them and list the top 3 tags present in the
images. if there are ties, list all of them.
The top 3 tags for the images downloaded
elephant
monkey
tiger
Getting Started
- In your browser of choice go to http://www.flickr.com/ and search for something. On the page with your results, what happened to the URL? What happens when you change the URL directly? What happens to the URL when you go to a second page of results?
- Use the man pages for wget to figure out how to specify the name of the output file for your image files.
- The GetOptions, mkdir, and chdir are all available to you.
- You *MUST* go to the next page if the user requests more images than are available on the first page. think of how to traverse all pages with images by encoding the URL
- Image tags in html are img src="http://www.foo.com/bar.jpg". You need to extract these tags to create the file
- Are all the images on the page related to the search term? Look at the html and see how the page is structured to get an idea of how you can parse it (hint: class=DetailPic)
- Perl is a magical language and the Internet is your friend. Look for pre-built modules that do what
you're interested in doing before reinventing the wheel. The "use" statements in the starter code should be a big hint.
lab Requirements
- You must have at least two sub routines in your program, getImages and makeHTML (see starter code)
- Your program must store image URL's in an array or hash table
- You must use regex to parse and extract the URL's
- If -f file_name is given, images will NOT be downloaded into the folder but links written to html file file_name
- If -f file_name is not given, then images must be downloaded into the folder
Downloading Files
As usual, download files from /afs/andrew.cmu.edu/course/15/123/downloads/lab6
Handing in your Solution
Your solution should be in the form of a .pl file. Submit to /afs/andrew.cmu.edu/course/15/123/handin/lab6
Grading
Your program will be graded according to the rubric.txt given in the downloads folder
FAQ
We always try to maintain an updated FAQ.txt accessible from Bb->labs. Please read the FAQ.txt file if you
have any questions. If you cannot find the answer in FAQ.txt please send email to any course staff, cc to guna@andrew
Better Mouse Trap
Do you have a creative idea about how to make this lab more interesting and practical. if so, please write the
idea(s) in a file called idea.txt and place it in the handin folder. If we like your idea and then we will seriously
consider using that in future labs and more importantly will give you some extra credit in this lab and
acknowledgment in future labs.
The Original idea for the lab came from Hassan Rom, CS major, 2006