Spring 2010 CS 510 Information Retrieval on the Internet

This Document is stored at www.cs.pdx.edu/~maier/cs510iri

Announcements (Last update 9:21am, 3 June 2010):

·      Lecture notes 20 posted

Instructor

David Maier maier at cs dot pdx dot edu, 115-14 FAB.
Melanie Mitchell mm at cs dot pdx dot edu, 120-24 FAB.

 Note: Please put ‘cs510’ at the beginning of the subject line.

Lab Assistant

Jeremy Steinhauer jsteinha at cs dot pdx dot edu, 115-H FAB

Phone:

Mitchell:  503-725-2412

Maier: 503 725-2406

Class Meeting

Tuesday, Thursday 10:00-11:15a, 150 FAB

Office Hours

Maier: Mondays 2-3pm

Mitchell: T,Th 4-5pm

Steinhauer: Fridays 1-2pm

You are welcome to ask questions by e-mail or phone.

Guest Lecturers

TBA

Weekly Schedule

[This schedule is preliminary and subject to change]

Assignments due Tuesdays

Wk

Date

Topic

Reading (will be refined) 

Slides

Homework

Due Tuesdays

(at 10 AM)

1a

Tues, 30 Mar

DM

Introduction to Information Retrieval;

Boolean Retrieval

 

Ch. 1

Lecture 1

Written Assignment 1
assigned

1b

Thurs, 1 Apr

DM, JS

Text Processing;

Introduction to Software for Project

Ch. 2

Lecture 2

Lucene

Lucene Project, Part I assigned

Document directory

2a

Tues, 6 Apr

DM

Indexing

Ch. 4

Zobel & Moffat
(see readings for citation)

Lecture 3

 

Written Assignment 1
due (8 points)

 

Experimental Project, Parts A and B, assigned

2b

Thurs, 8 Apr

DM

Scoring, Weighting, VSM

Ch. 6

 

Lecture 4

3a

Tues, 13 Apr DM, JS

Catch up,

Project info

 

 

 Section 7.1

Lecture 5

 

Experimental Project Part A
due (10 points)

3b

Thurs, 15 Apr

MM

Evaluation

Ch. 8

Lecture 6

 

4a

Tues, 20 Apr

MM

 (DM away)

Probability Review

Text Classification

Chs. 11, 13

 

Lecture 7a (Evaluation 2)

 

Lecture 7b Text Classification 1)

Lucene Project Part I due (10 points)

 

Written Assignment 2 assigned

 

Lucene Project ,Part II, assigned

 

Lucene Project ,Part II, slides

4b

Thurs, 22 Apr

MM

Vector-Space Classification

Ch. 14

Lecture 8a (Text Classification 2)

 

Lecture 8b Vector Space Classification)

 

5a

Tues, 27 Apr

MM

Support Vector Machines

 

 

 

Ch. 15                        

 

 

Lecture 9

Support Vector Machines

Written Assignment 2
due (8 points)


Written Assignment 3 assigned

Data for Written Assignment 3: spam.svm.train and spam.svm.test

5b

Thurs, 29 Apr

DM

(MM away)

 

Relevance Feedback

Chs. 9

Lecture 10

 

6a

Tues, 4 May

DM

Information Extraction, Segmentation, Summarization

Lecture 11

Written Assignment 3
due (8 points)


Written Assignment 4 assigned

6b

Thurs, 6 May

MM

Clustering 1

Ch. 16

Lecture 12

Project I answer key 

7a

Tues, 11 May

MM

Clustering 2

Ch. 17

Lecture 13

 

 

Written Assignment 4 due (8 points)

7b

Thurs, 13 May

MM

Latent Semantic Indexing

Ch. 18

Lecture 14

 

 

8a

Tues, 18 May

Guest: Steven Bedrick, OHSU

Image Retrieval

 

Lecture 15

Experimental Project Part B
due (20 points)

Written Assignment 5 assigned

Link to LSI tutorial

8b

Thurs, 20 May

DM

IR and the Web

Chs. 19, 20

Lecture 16

 

9a

Tues, 25 May

DM

IR and the Web

 Lecture 17

Written Assignment 5 due (8 points)

9b

Thurs, 27 May

DM

Making Information Findable

 

Lecture 18

 

10a

1 June

MM

Network Structure and Search, Part 1

  Ch. 21

Lecture 19

 Lucene Project Part II due (20 points)

10b

3 June

MM

Network Structure and Search, Part 2

 

Lecture 20

 

11

8, 10 June

Final exam week

 

 

 

Class E-mail

The e-mail list for this class will be cs510iri@cs.pdx.edu.  It will be used for announcements from the instructor.  You can also send questions and answers to this mail list.  You can subscribe to the list at https://mailhost.cecs.pdx.edu/mailman/listinfo/cs510iri.

Catalog Description

The Internet has seen the most extensive application of information retrieval (IR) techniques to date. At the same time, the Internet has often stressed traditional IR methods to the breaking point. This course introduces classical IR concepts, but also discusses how they are stressed when applied in the large, distributed and dynamic setting of the Internet, and covers some of the techniques used to get around the limitations. The first half of the course will address standard IR topics: keyword-based retrieval and indexing, classification, clustering and evaluation metrics for different approaches. The second half of the course will cover several different challenges for IR on the Internet, and technologies being developed to address them, such as

 • Collection building: crawling the web, duplicate detection, sampling, digital libraries

 • Providing context: annotation, meta-data

 • Distribution: scalable architectures

 • Heterogeneity: scraping, wrapping, translation

 • Enterprise and Organizational issues: standards, interoperation

The course will also examine particular systems for searching and intermediation on the Internet. This course may be used in the Databases track of the CS MS.

Textbooks

REQUIRED:
Introduction to Information Retrieval.
By Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press, 2008, ISBN-13: 9780521865715. 

You can access the book and related materials at: http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Reading

Readings will come both from the textbook and from supplementary materials.

Assignments

There will be written assignments and project assignments. The written assignments will generally involve some manual exercise, with a short write-up due (possibly with answers to specific questions).

There will also be two projects. The first will involve formulating a hypothesis about the behavior of a search engine (such as Google), devising a way to test the hypothesis, conducting that test and analyzing your results. The second will involve collecting, indexing and searching web content using the Lucene search-engine library.

 

Students registered for CS 510 (rather than 410) will have an additional section of each assignment to complete.

 

Grading
Assignments: There are 9 assignments, worth 100 points (100%) of your grade. Five will be written assignments worth a total of 40 points (8 points each).  There will be two projects, a {\it Lucene} project and an {\it Experimental} project, each consisting of two parts.   The first part of each project is worth 10 points; the second part of each project is worth 20 points. Some of the assignments are to be done individually; other assignments will have the option of being done individually or in groups of 2 or 3.  If you work in a team, then turn in one paper with the names of all team members on it.  Make sure your assignments are legible. You may seek help from your partner (if you have one) the instructors and the class mailing list, but otherwise work independently.  Assignments are due on TUESDAYS at the beginning of class.

 

There will be no exams in this course.

Information

The Google Garden is here.

A second planting is here.

Policies

Students are responsible for anything that transpires during a class – therefore if you're not in a class, you should get notes from someone else (not the instructor).  

Assignments are due at the beginning of the class period. 

Late homework and projects will not be accepted without prior approval from one of us.  Lack of prior approval is an automatic 50% off, or 0% if that assignment has been discussed in class. 

Requests for regrading must be submitted in writing within one week of the time the graded assignment was returned.  You must be specific in saying why you feel your answer deserves additional credit. 

Students with disabilities who are in need of academic accommodations should contact us as soon as possible to arrange needed supports.  Students are also encouraged to contact the Disability Resource Center (DRC) for additional information on  support services and available accommodations at 503/725-4240 or 503 725-4150.

Academic Integrity

[Excerpt from the 2004-2005 PSU Catalog, pages 29-30]
The policies of the University governing the rights, freedoms, responsibilities, and conduct of students are set forth in the Statement of Student Rights, Freedoms, and Responsibilities, as supplemented and amended by the Portland State University Student Conduct Code, which has been issued by the President under authority of the Administrative Rules of the Oregon State Board of Higher Education. The code governing academic honesty is part of the Student Conduct Code. Students may consult these documents in the Office of Student Affairs, 433 Smith Memorial Student Union or by visiting the OSA Web site.  Observance of these rules, policies, and procedures helps the University to operate in a climate of free inquiry and expression and  assists it in protecting its academic environment and educational purpose.

Academic honesty: Academic honesty is a cornerstone of any meaningful education and a reflection of each student’s maturity and integrity. The Office of Student Affairs is responsible for working with University faculty to address complaints of academic dishonesty.  The Student Conduct Code, which applies to all students, prohibits all forms of academic cheating, fraud, and dishonesty.  These acts include, but are not limited to, plagiarism, buying and selling of course assignments and research papers, performing academic assignments (including tests and examinations) for other persons, unauthorized disclosure and receipt of academic information, and other practices commonly understood to be academically dishonest.  For a copy of the Student Code of Conduct see the OSA Web site.  Allegations of academic dishonesty may be addressed by the instructor, may be referred to the Office of Student Affairs for action, or both. Allegations referred to the Office of Student Affairs are investigated following the procedures outlined in the Student Conduct Code.  Acts of academic dishonesty may result in one or more of the following sanctions: a failing grade on the exam or assignment for which the dishonesty occurred, disciplinary reprimand, disciplinary probation, loss of privileges, required community service, suspension from the University for a period of up to two years, and/or dismissal from the University.  Questions regarding academic honesty should be directed to the Office of Student Affairs, 433 Smith Memorial Student Union.

Supplementary Readings

Online textbook by C. J. van Rijsbergen: http://www.dcs.gla.ac.uk/Keith/Preface.html

Useful IR Resources

·         Lucene:

o   Overview: http://lucene.apache.org/java/docs/index.html

o   Downloads: http://www.apache.org/dyn/closer.cgi/lucene/java/

o   API: http://lucene.apache.org/java/docs/api/

·         Nutch (open source project for writing web crawlers; sibling project to Lucene)

o   Home page: http://lucene.apache.org/nutch/index.html

o   Download: http://www.apache.org/dyn/closer.cgi/lucene/nutch/

·         Stemming:

o   Article by Martin Porter

o   Lancaster Stemming Algorithm site

·         Probabilistic model

o   A Probabilistic model of information retrieval: development and status. by K. Sparck Jones, S. Walker, and S.E. Robertson

·         Controlled vocabularies/Thesauri

o   ANSI/NISO Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies

o   Medical Subject Headings (MeSH)

o   Art & Architecture Thesaurus

o   Agrovoc

·         Indexing

o   Inverted files for text search engines. J. Zobel and A. Moffat. ACM Computing Surveys Volume 38,  Issue 2  (2006).

o   The Term Vector Database: Fast access to indexing terms for Web pages. R. Stata, K. Bharat, F. Maghoul. Computer Networks 33(1-6), June 2000.

·         Sources of Advanced Readings

o   Recommended Reading for IR Research Students

o   Readings in Information Retrieval, edited by Karen Sparck Jones and Peter Willett.  Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1997.

·         Evaluation

o   Cumulated Gain-Based Evaluation of IR Techniques.  K. Järvelin and J. KekäläinenACM Transactions on Information Systems, Vol. 20, No. 4, October 2002, pp 422-446.