Parsing Metadata using a Finite State Machine

Assignment 3 – September 17, 2006

Due: 8.00 a.m. September 24, 2006

Language: Use C++/C#/Java (or any statically typed object-oriented language).

Objectives:

Sub-objectives:

Please start early on this assignment. It is more involved than the previous ones.

You must call your repository "project3". So I will be able to access it from https://csil-projects.cs.uiuc.edu/svn/cs242//project3. IF you have called it something else, you can correct this using:

svn move https://csil-projects.cs.uiuc.edu/svn/cs242// https://csil-projects.cs.uiuc.edu/svn/cs242//project3 -m "Moved old_project_name to project3 as required"
It is also a very bad idea to add your executable or debug folder to the repository. So do not do that. You can tell Subversion to ignore certain files by doing:

svn propset svn:ignore "*" path_to_your_debug_folder
If you have already added the files to the repository, you have to do a svn delete on them first and then do the above svn propset command.

NOTE: Some of you will already be asking: Why shouldn’t I just use an XML library to parse the metadata? Short answer: Just don’t do it. Longer answer: Yeah, we all know that we should use an XML parser. However, the purpose of this assignment is to get you to write a finite state machine. And parsing the metadata offers a good opportunity to use one. So just do the assignment with a finite state machine.

This is the first of several assignments in which you will write programs to generate a portfolio that shows the projects that you have done in this class. You will revisit this portfolio project for many weeks to come. So make sure you write code that is clear and easy to read.

The input to this program will be a file containing information about your projects (a “metadata” file). This metadata file includes information about each of your projects. While you may add enhancements to the format, make sure that your program is still able to read in a metadata file specified in the original format.

The following is the format for the metadata your program should handle. The hierarchy of tags matters: they should be in the organization specified here. Not all of the tags are mandatory, so for example a project might not have a date specified. The format is pseudo markup language: every element is of the form <name>information</name>. Valid text is only at the “leaf” tags: title, date, author, summary, f, versioninfo

    

        <portfolio>

        <project>
           <title>Raster Processor</title>
           <date>September 1, 2006</date>
           <author>John Smith</author>
           <summary>Reads a raster and computes information about each elements
        neighbors. Given . . .
           </summary>
           <files>
               <doc>
                   <f>./readme.txt</f>
                   <f>./compatibility.txt</f>
               </doc>
               <code>
                   <f>./src/raster.cc</f>
                   <f>./src/util.cc</f>
               </code>
           </files>
           <versionhistory>
               <versioninfo>2.1 Added GUI</versioninfo>
               <versioninfo>1.0 Initial working algorithm</versioninfo>
           </versionhistory>
        </project>
        <project>
           <title>Raster Processor Part 2</title>
           <date>September 8, 2006</date>
           .
           .
           .
           ...
        </project>

        </portfolio>

    

Listing 1: Metadata format

The intention is that tag names should be self explanatory. If you think that correct behavior on some inputs is ambiguous, choose a behavior and justify your choice.

The <f> tag is a bit more tricky. The text between the <file> and </file> specifies the location to file from your directory. You need to decide whether you want this to be relative or absolute to your current directory. Your choice will determine how you create the links to those files in your HTML file.

If you make any assumptions, constraints, generalizations, etc. be sure to specify them.

Figure 1: A subset of the finite state machine to parse the tags

The figure above is meant to give you an idea of how to proceed1. Each tag can be thought of as a state. And you can represent each state with an object or class in your programming language. Please make use of inheritance where possible. Also, feel free to come up with your own implementation as long as it uses a finite state machine.

Here is what you need to do:

  1. Read in a metadata file of the format defined. You need to populate it with your own data for the projects we have worked on so far.
  2. Parse this file using a finite state machine. Maintain an internal representation of the file. A tree is the most convenient data structure. But you can choose your own.
  3. Generate HTML files that present the information in the metadata file. There should be a separate HTML page for each. The entry point for navigation through the portfolio should be named index.html and it must link to the individual project pages. Note that the metadata includes the names of other files, such as source code files. These files should be linked from your portfolio page. Obviously, for the portfolio to be viewed, these files must also be in the directory containing the generated HTML.

These HTML files should offer a nice interface for browsing your projects. That is, you should be able to open the index.html file and browse to a page for any one of your projects, where you can see more detailed information and access the source code. The format and presentation of these files are up to you. You may make use of CSS or Javascript. For example, you could transform your metadata into the S5 format to convert them into a slide show.

Figure 2: An example of how the generated HTML files are linked

    

           <h2>Raster Processor</h2>

           <b>Date: September 1, 2006</b> <br />
           <b>Author: John Smith</b> <br />

           <b>Summary: Reads a raster and computes information about each elements
        neighbors. Given . . .</b> <br />

           <ul>
               <li>Documents: 
                  <ol>
                <li><a href="./readme.txt">./readme.txt</a></li>
                <li><a href="./compatibility.txt">./compatibility.txt</a></li>
              </ol>
               </li>
               <li>Code:
                  <ol> 
                <li><a href="./src/raster.cc">./src/raster.cc</a></li>
                <li><a href="./src/raster.cc">./src/util.cc</a></li>
              </ol>
               </li>
                </ul>

            Version history:
           <ul>
               <li>2.1 Added GUI</li>
               <li>1.0 Initial working algorithm</li>
           </ul>

    

Listing 2: A possible HTML output for the the first project from Listing 1

Raster Processor

Date: September 1, 2006
Author: John Smith
Summary: Reads a raster and computes information about each elements neighbors. Given . . .
Version history:

Listing 3: What the code from Listing 2 would look like rendered in HTML

If you need help on HTML , Google is your friend. Or you can always go to a website and select View Source to see the underlying HTML code. You can also post questions on the newsgroup.

A note on #2. We want to see an internal representation of the metadata. It is not enough to just output the relevant html from each tag as you encounter it. For those of you who have worked with XML parsers, this means that we want a DOM model and not a SAX model. We can verify this by looking at your code or by running the program in the debugger.

Grading:

Read before starting:


1 If you are willing to not use a pure finite state machine, you can make things simpler. The following is a more general approach but it requires the use of a stack to keep track of the tags that you have encountered. Again, it assumes that only “leaf” tags have text in between.

Of course, newlines have to be handled properly as well.

This is also just a suggestion. You can come up with your own hybrid implementation.

Figure A: A hybrid finite state machine that uses a stack to remember the name of the “open” and “closed” tags.