Course:

Exam questions

Course: Advanced Topics in Information Retrieval

Prof. Alexander Gelbukh

Spring 2004

1 General

1. What is IR?

2. What is the importance of IR?

3. What applications does IR have, now and in the future?

4. Is current IR science, art, or engineering discipline? What is the difference?

2 Introduction

5. What is the difference between IR and data retrieval?

6. What are the main concepts of IR as a science?

7. What are the main concerns of IR? What are the problems it confronts and what does it aim to improve?

8. What is the user information need? What is the user query? What is the difference between them?

9. Does an IR system retrieve documents, order them, or both? Why? In what cases which mode is suitable?

10. What are the main steps in a user session with an IR system?

3 Modeling

11. What is modeling? What is the purpose of modeling in IR? How is it done?

12. How can you classify IR systems? What parameters do characterize an IR system?

13. Give a mathematical definition of an IR system.

14. Enumerate the main IR models. Discuss the less common (alternative) models or variations (refinements) of the main models.

15. In the basic IR model, what the term weights are? How the documents are represented?

16. What the Boolean model is? How does it work? What are its advantages and disadvantages?

17. What the Vector Space model is? How does it work? What are its advantages and disadvantages?

18. What the TF-IDF weighting scheme is? What factors does it take into account? In what models is it used? In what models is it not used?

19. What is relevance feedback?

20. What the Probabilistic model is? How does it work? What are its advantages and disadvantages?

21. What is the idea of Latent Semantic Indexing model?

22. What is the idea of a Neural Network model? Does it work well?

23. What are the main models for browsing?

24. Which of the main IR modes is the simplest? Which one is considered currently the best? Why?

4 Retrieval Evaluation

25. Why is evaluation important?

26. What a baseline is?

27. Would you evaluate the correctness of the results in terms of the algorithm used or in terms of the user task? Why?

28. What are main evaluation parameters specific for IR? Is it just one value? Why is it a problem? What are the possible solutions for this problem?

29. How an IR can be evaluated in practice?

30. What are the test reference collections? How are they created and used?

31. What precision and recall are? For what model they are used? For what model they are not used?

32. What is more important for a text IR system: precision or recall? In what case which one is more important? Why?

33. How the ranked output can be evaluated? What are the advantages and disadvantages of plots and diagrams? What are the advantages and disadvantages of single-value summaries?

34. What plots and diagrams are used to evaluate ranked output? What single-value summaries are used in IR?

35. What is F-measure? What is E-measure? What is R-precision? For what models are they used?

36. What reference collections do you know? What are their advantages and disadvantages?

5 Indexing and Searching

37. What is an index? How is it used?

38. What are the advantages and disadvantages of indexed and sequential search? Can indexed and sequential search be combined? How and what for?

39. What is an inverted file? What is its size? How is it used?

40. How can an inverted file be built?

41. What is block addressing? What is its overhead in terms of size and time? How is it used? What are its advantages and disadvantages? What collections is it good for?

42. What are signature files? How are they used? What are their advantages and disadvantages? What collections are they good for?

43. What is a suffix trie? A suffix tree? A suffix array? What are their advantages and disadvantages?

44. What methods give less space overhead? What methods are faster? What methods are both fast and give small space overhead? Why do people use methods other than those?

45. How are Boolean queries resolved? What is the complexity of such an algorithm? What techniques can be used to improve it?

46. How is search combined with compression? Is it true that compression gives a gain in disk space but slows down the search?

6 Multimedia IR

47. What are the applications of multimedia IR?

48. What aspects make multimedia IR methods different from text IR?

49. What is a usual user session with a multimedia IR system? What is the difference with a text IR system?

50. How are multimedia objects modeled? What is the difference with text IR? What is metadata?

51. How can multimedia IR be combined with text IR? How does Google search for images?

52. What characterizes a multimedia IR query language? What is the difference with text IR? Why?

53. What is a similarity function? What similarity functions do you know for multimedia data types?

54. What IR models are used with multimedia data? What are the main similarities and differences between multimedia and text IR?

7 Multimedia IR Indexing and Searching

55. Explain how multimedia IR is reduced to search in multidimensional space. Explain the role of clustering.

56. Discuss the role of feature selection for multimedia IR. Give examples of good and bad features. Is manual selection of features used in text IR?

57. What are the possible types of multimedia IR queries?

58. What is more important for a multimedia system: precision or recall? Why? What is correctness of a method?

59. How can the search speed be improved? What is the GEMINI method? What features can be selected for the GEMINI method? What is the lower-bound lemma? Does the GEMINI method improve the quality of the results, speed, or both? What is the assumption behind the GEMINI method to speed up the search?

60. What are time series? What features are suitable and what are not for the GEMINI method applied to time series? How are they used? What is a reasonable number of such features?

61. How the similarity between images is measured? What the color similarity matrix is? Why is it is not used in text retrieval? What is a similar method in text retrieval?

62. What are the features of images suitable for the GEMINI method?

63. What automatic feature selections methods are there? What are advantages and disadvantages of automatic feature selection as compared to manual feature selection?

8 Parallel and Distributed IR

64. What is the single-query response time? What is throughput?

65. What problem does the parallel and distributed IR solve?

66. What are the measures for evaluation of parallel and distributed systems and algorithms?

67. What are document and term partitioning? How do they work? What are logical and physical partitioning? What are their advantages and disadvantages?

68. How document and term partitioning are used with inverted files, signature files, and suffix arrays?

69. What is the difference between parallel and distributed systems? What kind of partitioning is better for what kind of systems? How clustering can help in distributed IR?

70. What is a bottleneck for parallel and distributed systems?

71. What a meta-search engine is? What is the main problem for such a system?

9 Natural Language Processing for IR: Synonymy

72. What is the importance of text processing for IR? What are the main obstacles for application of text processing to IR?

73. What are the levels of “understanding” of a text?

74. What are the main problems for text understanding and text processing?

75. What is synonymy? Is it a big problem? What is the solution? Give examples of synonymy at different language levels. What is hyponymy/hypernymy? What are their similarities and differences with synonymy?

76. What is ambiguity? Is it a big problem? What solutions are there? Give examples of ambiguity at different language levels.

77. Why does the computer need knowledge to understand texts? What kind of knowledge does it need?

78. How can synonymy be handled in IR? What is query expansion? How can synonymy be handled at index time? What are the advantages and disadvantages? What is the role of an ontology?

79. What is morphology? How is it handled? What are the main problems in its handling?

80. What is stemming? What types of stemmers are there, and what are the general principles of their work? (Details of Porter stemmer are not required.)

10 Natural Language Processing for IR: Ambiguity

81. What is the main problem of text understanding?

82. What is tagging? What problem does it solve? What is a tagger? How does it work? How can it be applied in IR?

83. What is a Hidden Markov Model? How is it related with tagging?

84. What is word sense disambiguation? What problem does it solve? How is it done? How can it be applied in IR?

85. What are word relatedness measures? What is Lesk algorithm? What are Yarowsky’s principles, and how are they used for word sense disambiguation?

86. What is word anaphora resolution? What problem does it solve? How is it done? How can it be applied in IR?

87. How are ambiguity resolution systems evaluated?

88. What are dictionary-based methods and statistical methods? What are their advantages and disadvantages?

11 Natural Language Processing for IR: Syntax

89. What are language levels? What language levels are there?

90. Language as encoder and decoder. What is the source of problems?

91. Linguistic module as a meaning-text translator.

92. What representations are used at different language levels?

93. What is syntactic representation? Is it language-dependent?

94. What is dependency structure? What is constituency structure? What are their advantages and disadvantages?

95. What is a syntactic tree?

96. What is a phrase structure grammar?

97. What is the context-independency hypothesis?

98. What is the generative idea? How is it related with the meaning-text translation idea?

99. What is parsing? How is it done?

100. What is syntactic ambiguity? How is it resolved?

101. What is shallow parsing?

102. How are syntactic ambiguity resolution systems evaluated?

103. What is the importance of syntactic analysis for IR? What problems does it solve? What ambiguity problems does it not solve?

12 Natural Language Processing for IR: Semantics

104. What is semantic representation? Is it language-dependent? What is the difference with syntactic representation?

105. What are lexical functions? What are their applications?

106. What is a semantic network? What is a logical representation of a semantic network? What are semantic valencies?

107. What is the common-sense knowledge and how is it used in semantic networks?

108. What are conceptual graphs? How are they used in IR? How are they obtained from the text?

109. How can conceptual graphs be compared to define a similarity measure on texts? How is this measure used in IR? What are its advantages and disadvantages?

110. What other semantic-rich representations (other than a bag of keywords) can be used for IR? What are their advantages and disadvantages?

111. What is Question Answering?

112. What is passage extraction?

113. What is text summarization?

114. What is information extraction?

115. What is cross-lingual IR?

The End