| Course Description: | The Internet has seen the most extensive application of information retrieval (IR) techniques to date. At the same time, the Internet has often stressed traditional IR methods to the breaking point. This course introduces classical IR concepts, but also discusses how they are stressed when applied in the large, distributed and dynamic setting of the Internet, and covers some of the techniques used to get around the limitations. The first half of the course will address standard IR topics: keyword-based retrieval and indexing, classification, clustering and evaluation metrics for different approaches. The second half of the course will cover several different challenges for IR on the Internet, and technologies being developed to address them, such as • Collection building: crawling the web, duplicate detection, sampling, digital libraries • Providing context: annotation, meta-data • Distribution: scalable architectures • Heterogeneity: scraping, wrapping, translation • Enterprise and Organizational issues: standards, interoperation The course will also examine particular systems for searching and intermediation on the Internet. The main assignments for the course will be a series of projects, done individually and in groups. This course may be used in the Databases track of the CS MS. |