CS 241: System Programming

Machine Problem 3
File System Indexing

Deadline: Monday, 3 April, 4:00pm
Extended Deadline: Tuesday, 4 April, 4:00pm
(electronic submission)

Solution: Makefile, index.c, index.h, indexer.c, searcher.c (helper.h, helper.c)

Description This MP will direct you through indexing a directory of files. You will first be designing a data structure and corresponding serialization capabilities, then continuing on to use this data structure to index an individual file. You can then extend your program to index a whole directory of files, recursively.

Please be sure to read the sections mentioned below in R&R before you begin work on this assignment. A clarification section is also below the grading criteria at the bottom, and will be updated as necessary.

Data structure serialization is tricky to understand and implement correctly. You may choose to ignore that part of the assignment, and implement a slightly altered version of the assignment; but you will not be eligible for any points in the data structure design section, and some points in the remaining sections.

If you complete all parts of the MP, you should end up with two executables, indexer and searcher, such that the former takes a directory as an argument, indexes every file, and saves the index to a file specified by another argument. The latter program will load the index into memory, and search it for a keyword (or keywords) which was provided at the command-line. Both programs will also print timing information. Example output is provided here.

Part 1: Data Structure Design and Serialization

In order to proceed through the rest of the MP, we must first design a suitable data structure. Our goal is to have a data structure (and methods) that allow us to store a tuple of information. We will want to separate words in a file to create an index for that file. Thus, we want our tuple to store a word from the file, the file it appears in, and where it appears in the file (byte offset). We need one of these tuples for each occurrence of any word in any file at any offset. One assumption you can make is that the keywords are only lowercase alphabetic characters. In Part 2, you will be using a helper function we provide to convert all keywords to lowercase alphabetic before insertion.

Start by designing on paper how your data structure will look. Keep in mind all the various directions it will need to grow. For example, if we are storing our files and keywords separately, we should probably use linked lists for both, as both will need to be extendable. If we used an array, we would have expensive operations to grow and shrink it. What kind of optimizations might we want to add? Skip-lists? Hash-tables? Tries? There is no correct data structure to use. But keep in mind you will be storing thousands of keywords, associated with scores of files.

We also need to be able to serialize the data structure we create. A data structure in memory is no good to us, as our program is not persistent. Serialization will allow us to take all the pointers, and convert them into array indices. If we have a linked list, we traverse it once to get the size, malloc some memory to store it as an array, and then copy over the data we need. If we have a trie, we need to do some order-preserving traversal of it, and store all the information we see.

We will be teaching simple serialization methods in discussion the week of 13 March. There is also an example of serializing a data structure of float pointers included below.

Once you have designed how your data structure stores information, you will probably want to add helper methods that allow the rest of your program to use the structure. You will probably need some sort of Insert function that adds the tuple of (keyword, file, offset) into the structure. Having also a Search may come in handy.

You can design the structure however you wish, and provide as few or as many methods that operate on the data structure as you need for the rest of your program. Store the structure in a index.h and index.c file. Include the serialization methods there as well.

Also, create a formal report of your data structure including all of the following: PNG file (index.png) with a drawing of the elements of the data structure, and the pointers (as arrows) between various elements; a summary of the performance of the structure as used for this MP (eg big-O notation for insert and search operations); and a few sentences on why you choose to use this structure.

Part 2: Indexing a File

Create a program, indexer (indexer.c), which takes as an argument a filename, and produces an index of that file. This program will want to use the index data structure you created above, and call Insert repeatedly on the structure to add tuples of information to the index. The index would then contain all the information about a file; it would contain each and every word that appears in the file, and where it appears.

Following Chapter 4 in R&R, open the file specified at the command-line. Read all of the contents into a buffer using restarted calls to read (or use r_read from the restart library). Use a restarted close (or r_close) to close the file after you have filled the buffer. Now run processString from the given code on the buffer to process it. Using strtok, break the string into pieces, and insert each of them into the index. Finally, serialize that index into a similar buffer, open the index file, restart write (or r_write) the buffer into the file, and close it again. The index should now be stored on disk.

We use processString to get rid of all non-lowercase alphabetic characters in the buffer. We don't want to bother indexing numbers, or symbols; and we want "don't" and "dont" to index the same way. This helper function provided in helper.c gets rid of all of this stuff. Please use this function for uniformity. For example, "#include <stdio.h>" would become "include stdioh " before indexing; strtok will return "include" and "stdioh", which you should then insert into the index.

In order to properly size the buffer that you will store the contents of the file into, you may find use in stat/fstat as mentioned in Section 5.2.1 of R&R. The st_size member of struct stat will give the size of a file.

Part 3: Indexing a Directory

Modify your program above to work on directories. It should now take as a command-line argument a directory to search through. The program should operate similarly to Program 5.3 in R&R, except instead of printing the filenames, it will need to run whatever function you created above to index a single file. After you modify your program to operate on a directory, it is no longer necessary to operate on a single file.

Note that the directory listing provided by readdir will include . and .., but these should be ignored. Do not follow these directories, as your program will then try and index the entire file system, and never stop (/usr/../usr/../usr/../usr/../usr/.. ...).

To be valid, your solution must also recursively enter into directories found within the specified directory. This is nontrivial, as readdir stores information internally. The man pages for telldir and seekdir may come to be of assistance. Also, keeping a queue of directories to traverse (breadth-first indexing vs depth-first) will bypass this problem.

To determine if an entry in the directory is another directory, you can try changing into it (will succeed if it's a directory, fail otherwise) with chdir, or using the same struct stat from above and examining st_mode with the S_ISDIR macro as described in Section 5.2.2 of R&R.

Part 4: Searching an index

Write a second program, searcher (searcher.c), which will load a saved index from disk into a buffer, unserialize it, and then search it for an occurrence of a specified keyword. You should also make use of processString here, in order to properly search for a keyword in the same way it was indexed.

Print out all results from the index file in any fashion you want, along with their offsets. More than one keyword may be present at the command-line, one after another (ie searcher index word1 word2 word3). If multiple keywords are specified at the command-line, search for each one individually. If one argument specified at the command-line is actually multiple keywords, use strtok to split it. See sample output for more examples.

Parts 2-4: Alternate Assignment

If you are having trouble with the data structure design part, you can craft a single program, finder (finder.c), to do the reading of the files, as well as the searching. Open, read, and close files in the same way as part 2 above, and put the contents into a buffer. Run processString on the buffer in the same fashion. But now just compare the command-line specified keywords to every string returned by strtok on the data read from a file. If there are any matches, print out results in the same manner that part 4 would have.

Also extend your implementation to search through an entire directory of files (recurisvely still), and compare the command-line keywords to each token in each file encountered.

Doing this part is not necessary to getting a perfect grade on this assignment, and should only be completed by students wishing to skip part 1. Also remember that doing only this section (and part 5 below) will result in a lower highest possible grade for the MP.

Part 5: Timing of Functions

Similarly to MP2, we want to measure various execution times of your implementation. A slow implementation is not worth any fewer points, but we still want the times printed.

Follow the sample output above, and mark each file index with the amount of time it took to index that file. Also sum up all the index times and print that value as well as the total execution time of the program. Do similarly for the searching program (or finder if you did the alternate assignment).

Same as MP2, use clock_gettime as used in Example 9.8 in R&R to time the various requested behaviors.
Given Code helper.h, helper.c: Includes the processString method useful for preprocessing a buffer read from a file, before tokenizing and indexing.
floatlist.h, floatlist.c, main.c: A sample data structure and serialization, with main(), and a few comments.
Sample output, same link from above
Deliverable This assignment is expected to be completed via pair programming. You are encouraged to form a group of 2, although you may elect to work alone. Newsgroup communication about general issues is allowed. Please post any questions on the given code on the newsgroup. You should use the CSIL Linux machines to do the assignment.

You are strictly required to use C. For Part 1, create a index.h and index.c file to store the data structure; and create a index.png file to represent the structure you have created. For Parts 2-3, create indexer.c to index a directory of files. For Part 4, create searcher.c to search through the index created by your indexer. For Part 5, edit those two files to include timing information. Finally, create a makefile which builds two executables, indexer and searcher, from the source code and the given files helper.h and helper.c.

If you choose to do the alternate assignment, you should create only a finder.c and a makefile which builds it into finder, still using the given code where necessary.

Create a README file as well, detailing any problems encountered during program design or execution. Include in the README the following information:
  1. MP number and title
  2. Your names and netids
  3. How you split the work for the entire MP
  4. Analysis of Data structure
Handin Use Compass to electronically handin an archive containing the files listed above (makefile, index.c, index.h, indexer.c, searcher.c, index.png, README OR makefile, finder.c, README) under the "MP3 Handin" assignment. This must be done before 4pm on Tuesday, 4 April. No submissions accepted after 4pm on Tuesday, 4 April.

Grading Criteria Ideal Submission:
  • Part 1: Data structure - 20%
    • Design - 4%
    • Serialization - 10%
    • Analysis - 6%
      • PNG graphic of structure - 2%
      • Evaluation in big-O of insert and search - 2%
      • Subjective review of the structure (why did you pick it?) - 2%
  • Parts 2 & 3: Indexing a directory of files - 50%
    • Indexing of a file within a directory - 25%
    • Indexing a directory recursively - 25%
  • Part 4: Searching an index - 15%
  • Part 5: Timing information - 10%
  • Makefile and README - 5%

Part 2 alone is worth 25% if Part 3 is not completed

The alternate assignment has the same 10% for part 5, and 5% for Makefile and README. The finder code is worth an additional 55% (30% if there is no recursive searching of directories). This results in a maximum MP score of 70%.
Clarifications
  1. To calculate an offset in a file (needed for insertion into the data structure, and is printed in Part 4), you can use pointer arithmetic. If buf points to the beginning of your character buffer, and strtok returns a pointer ptr to the next token, then ptr - buf will be an integer that represents the offset in the file of that token.
  2. You are welcome to use a publicly available data structure for Part 1, provided you write the serialization, as well as provide the analysis of it.
  3. The keywords that result from Part 2 after running processString will be composed of only lowercase alphabetic characters. You can plan around this while designing your data structure in Part1.
  4. You are not allowed to perform the task of serialization before all files have been processed and your data structure has been fully created. Doing so trivializes one part of the assignment.
  5. After running processString on the input arguments for searcher, call strtok on that buffer to return the word/words that are contained there. processString may add extra spaces that strtok would remove, and not removing them would result in a failed string compare.
  6. Your unserialization code should not make any calls to your Insert function you wrote for your data structure. We want to reconstruct an exact copy of what was in memory before, and Insert only guarantees we get a copy that stores the same information, not necessarily in the same manner.