15/Abr (21:00) - Data e hora limite de entrega com atraso. Um valor de penalização por cada dia de atraso.
Changelog
- 06/Abr - Foram adicionadas, logo no início do enunciado, as regras de submissão, que envolvem dois concursos do Mooshak: "concurso de teste" e "concurso para submissão final".
- 03/Abr - Havia uma ocorrência em excesso de "list" no tipo do argumento "f" das funções "transform" e "transform_d".
- 02/Abr - Nova secção "Eclipse, using the interpreter", com mais uma via a tentar, no caso de você continuar a ter problemas em usar a biblioteca Xml-Light no Eclipse.
- 01/Abr - Nova secção "Plan B for Eclipse", que interessa a quem não estiver a conseguir usar o plugin de OCaml dentro do Eclipse.
- 29/Mar - A secção "Installing the Xml-Light library" foi renomeada "Using the Xml-Light library" e agora explica como se usa a biblioteca do XML dentro do Eclipse e na consola.
- 27/Mar - Possíveis correções a este enunciado serão assinaladas aqui.
Regras de submissão
Playing with XML
Introduction
XML (Extensible Markup Language) is a widely used textual data format for storing electronic documents and for representing data structures transfered between Web applications. It can express both rigidly structured data, like tables, and loosely structured data, like office documents.
XML was initially created to overcome the limitations of HTML. While HTML is a textual data format oriented for the graphical presentation of data using a set of predefined tags (e.g. <p>, <h1>, <table>), XML is a format for representing data in general (regardless of their graphical appearance) and allows the users to invent their own tags.
This LAP project explores the XML format in a basic way, using it as an excuse for practicing the manipulating of n-ary trees. There will more meaningful uses of XML in later courses of the MIEI.
The XML format
An XML document is a text document containing certain syntactic constructions that add structure and meaning to the text. These syntactic constructions use special codes called tags that the user invents to convey intended meanings.
To illustrate, here is an XML document with some data about a lady called Alice and her children:
<person gender='F'>
Alice
<children>
<person gender='M'> Andre <children></children> 5 years old </person>
<person gender='F'> Maria <children></children> 4 years old </person>
</children>
35 years old
</person>
Only two tags are used in this example: 'person' and 'children'. When the tags are well chosen, the XML document becomes self-explanatory. You should be able to answer the following questions about Alice and her children: How old is Alice? How many children does she have? What is the name and age of her children? Does she have grandchildren?
This Alice document describes a n-ary tree. As a matter of fact, every XML document describes a n-ary tree. The part of the text that describes the entire tree or any subtree is called an XML element.
Syntactically, each XML element begins with the opening of a tag, e.g. <person>, and ends with the closing of this same tag, </person>. The opening of a tag can optionally include some attributes with the corresponding values, as in this example <person gender='M'>. Finally, between the opening and closing of a mark, lies the contents of the element, consisting of plain text optionally interspersed with XML sub-elements. The plain text component is traditionally called pcdata. In the Alice example, the contents of the main element consists in: the segment of pcdata "Alice"; then a sub-element describing the children of Alice; finally the segment of pcdata "35 years old".
An element without contents, such as <children></children> can be abbreviated like this <children/>.
Now, let us observe a large XML document - a well known comedy play by William Shakespeare, encoded in XML:
The Xml-Light library
XML-Light is the name of an OCaml library, developed by Nicolas Cannasse, that supports XML processing in OCaml. It is a free software package, subject to the LGPL license. It is not part of the official distribution of OCaml.
This library consists of a small number of modules, but we only need to deal with a particular module called Xml. This module is mainly concerned with the syntax of XML and therefore provide some functions for reading and writing XML documents. The reading functions convert the XML syntax to an OCaml representation, and the writing functions do just the opposite, converting the OCaml representation to XML syntax.
Representation of XML in OCaml
The OCaml representation of XML in the module Xml is defined by the following recursive type.
type xml =
Element of string * (string * string) list * xml list
| PCData of string
;;
Any value of this type, we will call an xml tree, or more simply a tree.
The type xml is a sum type with the following two variants:
- Element - represents an XML element, characterized by a tag, a list of attributes with the corresponding values, and a contents list; any value of this variant we will call an element.
- PCData - represents a maximal segment of plain text, occurring inside the contents of a XML element; any value of this variant we will call a pcdata.
To better understand the type xml, here is how the Alice example gets represented:
let alice =
Element ("person", [("gender", "F")], [
PCData " Alice ";
Element ("children", [], [
Element ("person", [("gender","M")], [
PCData " Andre "; Element ("children", [], []); PCData " 5 years old "
]);
Element ("person", [("gender","F")], [
PCData " Maria "; Element ("children", [], []); PCData " 4 years old "
])
]);
PCData " 35 years old "
]) ;;
The functions of the module Xml
The functions provided by the Xml module are documented here: Xml.html.
The relevant functions for our project are these:
tag : xml -> string
pcdata : xml -> string
attribs : xml -> (string * string) list
attrib : xml -> string -> string
children : xml -> xml list
parse_file : string -> xml
parse_in : in_channel -> xml
parse_string : string -> xml
to_string : xml -> string
to_string_fmt : xml -> string
Examples
Here is a small program written in OCaml that exemplifies the use of the parser and the printer supplied by the module Xml:
#load "xml-light.cma" ;; (* Loads the library Xml-Light -- remove this line in Eclipse *)
open Xml ;; (* Allows direct access to the symbols of the module Xml *)
let t = parse_string "
<person gender='F'>
Alice
<children>
<person gender='M'> Andre <children/> 5 anos </person>
<person gender='F'> Maria <children/> 4 anos </person>
</children>
35 anos
</person>"
in
print_string ("XML formated = " ^ to_string_fmt t ^ "\n")
;;
Now, an example of a function that processes an xml tree. This function checks if there is any pcdata in a given xml tree.
let rec existsPCData t =
match t with
Element (g,a,cs) -> lexistsPCData cs
| PCData _ -> true
and lexistsPCData tl =
match tl with
[] -> false
| t::ts -> existsPCData t || lexistsPCData ts
;;
Using the Xml-Light library
Get the source code of Xml-Light here: xml-light-2.2.zip.
To compile the library, use the following compound command:
unzip xml-light-2.2.zip ; cd xml-light ; make ; make
Eclipse, using the compiler
Do this inside Eclipse:
- Create the MoreXml project in the usual way (don't forget, it is an "OCaml Managed Project"!)
- Project > Properties > Project Paths > Add -----> the full path to the directory xml-light
- Project > Properties > Ocaml Build Flags > Add -----> xml-light.cma
- In the file "MoreXml.ml", the following line must appear before the definition of the functions:
open Xml ;;
Eclipse, using the interpreter
Do this inside Eclipse:
- Create the MoreXml project in the usual way (don't forget, it is an "OCaml Managed Project"!)
- Simply place the following three lines before the definition of the functions:
#cd "the full path to the directory xml-light"
#load "xml-light.cma"
open Xml;;
Plan B for Eclipse
If are unable to make any of the previous technique work, here is a simpler but inferior alternative:
- Forget about the Xml-Light library altogether.
- Place this definition directly in your program:
type xml =
Element of string * (string * string) list * xml list
| PCData of string
;;
- You can obtain here two XML documents alterady converted to the OCaml representation.
Console
To develop the project using a simple text editor and the ocaml interpreter running in a console:
- Copy the files xml.cmi and xml-light.cma to the directory of your project
cp xml.cmi xml-light.cma MyProjDir
- In the file "MoreXml.ml", the following two lines must appear before the definition of the functions:
#load "xml-light.cma" ;;
open Xml ;;
The goal of this project
The module Xml offers a XML parser and a XML printer. However, to help in the task of writing XML processing programs, we feel the need for some more general purpose XML oriented functions.
The goal of this project is to develop a new open module, called "MoreXml", that would provide a collection of generic XML processing functions implemented on the top of what the module Xml already offers.
The name of the source file must be "MoreXml.ml".
The open module MoreXml
Here are the specifications of the functions to implement:
tag : xml -> string
(* t res *)
The tag of the xml tree t. If t is an element, returns the natural tag of t. If t is a pcdata, returns the empty string "". Note that the tag of an element can never be the empty string, so we can use the empty string to represent and "extendend tag" for pcdata.
attributes : xml -> (string * string) list
(* t res *)
The attributes of the xml tree t. If t is an element, returns the natural attributes of t. If t is a pcdata, returns the empty list [].
contents : xml -> xml list
(* t res *)
The contents of the xml tree t. If t is an element, returns the natural contents of t (that is the children of t). If t is a pcdata, returns the empty list [].
leaf : xml -> bool
(* t res *)
Checks whether the xml tree t is a leaf. An xml tree is a leaf if and only if it is a pcdata or if it is an element without contents.
size : xml -> int
(* t res *)
The size of a xml tree t is the number of nodes it contains. A tree that is a pcdata has size 1. A tree that is an element without contents also has size 1. A tree that is an element with some contents will always have a size greater than 1.
height : xml -> int
(* t res *)
The height of the xml tree t is the length of the longest descending path that is possible to find in a xml tree. A tree that is a pcdata has height 1. A tree that is an element without contents also has height 1. A tree that is an element with some contents will alway have height greater than 1.
width : xml -> int
(* t res *)
The width of the xml tree t is the maximum of widths of all levels. A tree that is a pcdata has width 1. A tree that is an element without contents also has width 1. A tree that is an element with some contents have width 1 or greater.
changeTag : string -> string -> xml -> xml
(* g1 g2 t res *)
Creates a copy of the xml tree t, replacing all the occurrences of the tag g1 with the tag g2. Nothing more changes in the tree.
select : (xml -> bool) -> xml list -> xml list
(* f tl res *)
Selects from the xml tree list tl the trees for which the predicate f returns the boolean value true. The result is the list of selected trees in the same order they occur in the input list.
select_d : (xml -> bool) -> xml list -> xml list
(* f tl res *)
Performs a depth-first transversal of the xml tree list tl, selecting all the subtrees for which the predicate f returns the boolean value true. The result is the list of selected subtrees in the same order they are discovered during the transversal. Note that a tree is considered subtree of itself.
count : (xml -> bool) -> xml list -> int
(* f tl res *)
Given the xml tree list tl, counts the number of xml trees for which the predicate f returns the boolean value true.
count_d : (xml -> bool) -> xml list -> int
(* f tl res *)
Performs a depth-first transversal of the xml tree list tl, counting all the subtrees for which the predicate f returns the boolean value true. Note that a tree is considered subtree of itself.
project : string -> xml list -> xml list
(* tag tl res *)
Selects from the xml tree list tl the trees with the tag tag. If tag is the empty string "", then the pcdata values are selected. The result is the list of selected trees in the same order they occur in the input list.
project_d : string -> xml list -> xml list
(* tag tl res *)
Performs a depth-first transversal of the xml tree list tl, selecting all the subtrees with the tag tag. If tag is the empty string "", then the pcdata subtrees are selected. The result is the list of selected subtrees in the same order they are discovered during the transversal. Note that a tree is considered subtree of itself.
transform : (xml -> xml) -> xml list -> xml list
(* f tl res *)
Applies the function f to all the trees in the xml tree list tl, returning the list of results by the same order. The function f may be partial, i.e. can generate exceptions when applied to certain elements. Any tree that generates an exception is transfered unchanged to the result. In other words, an exception is interpreted as meaning "no transformation for this tree".
The function f is allowed to call the function transform. This is useful if the transformation of the input tree is a depth transformation (i.e. not shallow).
transform_d : (xml -> xml) -> xml list -> xml list
(* f tl res *)
Applies the function f to all the trees in the xml tree list tl, returning the list of results by the same order. The function f may be partial, i.e. can generate exceptions when applied to certain elements. Any pcdata that generates an exception is transfered unchanged to the result. As for any element that generates an exception, its tag and attributes are left unchanged in the result, but its contents is recursively processed using transform_d. In other words, an exception is interpreted as meaning "no transformation for the root of this tree, but propagate the transformation to the subtrees".
This function is similar to transform except for the automatic propagation of the transformation to the contents of the elements that generate exceptions.
transform_d supports systematic depth transformations of a list of xml trees. The function f can consider several patterns of interest and specify transformations for the input trees matching those patterns. The non-matching input trees should rise exceptions, for example "Match_failure" exceptions.
The function f is allowed to call the function transform_d. This is useful is the transformation of an input tree matching a pattern of interest requires a depth transformation (i.e. not shallow).
shakespeare : xml -> int * int * int * int * int
(* t speeches lines speakers minspeech maxspeech *)
Gathers some statistics about the speeches contained in a William Shakespeare play. As you can see in this play, a SPEECH element does not have attributes and its contents is a single occurrence of a SPEAKER element followed by a sequence of LINE and STAGEDIR elements. We are interested in these five statistics:
- Total number of SPEECHes.
- Total number of LINEs.
- Total number of distinct SPEAKERs.
- The length of the shortest SPEECH, measured in the number of LINEs.
- The length of the longest SPEECH, measured in the number of LINEs.
Regras principais
- Produza um ficheiro chamado MoreXml.ml. Nas regras de submissão será explicada a forma de submeter no Mooshak.
- O ficheiro "MoreXml.ml" tem de incluir logo nas primeiras linhas, um comentário inicial contendo: o nome e número dos dois alunos que realizaram o projeto; indicação de quais as partes do trabalho que foram feitas e das que não foram feitas (para facilitar uma correção sem enganos); ainda possivelmente alertando para alguns aspetos da implementação que possam ser menos óbvios para o avaliador.
- O projeto é para ser realizado por grupos de dois alunos. Um projeto entregue por três ou mais alunos vale zero valores. Poderão ser permitidos grupos de um aluno em circunstâncias especiais que têm de ser bem explicadas no comentário inicial atrás referido.
- Na realização deste projeto é proibido usar os mecanismos imperativos que a linguagem OCaml suporta, mas não foram estudados nas aulas.
- Mesmo que desenvolva o programa em Windows ou no MacOS, a versão final do seu programa deverá correr no sistema Linux instalado nos laboratórios.
- Programe as funções recursivas usando o método indutivo. Também pode usar livremente funções de biblioteca, especialmente as disponíveis no módulo List.
- O programa deve ser bem indentado, por forma a ficar bem legível. Além disso, a largura do programa não deve exceder as 80 colunas para poderem ser impressos. Podem haver algumas exceções, muito pontuais.
- O não cumprimento das regras anteriores implica penalizações automáticas na nota.
Regras de entrega
- Será ativado um concurso do Mooshak, que servirá para submeter os trabalhos. Os detalhes da forma de fazer a submissão serão divulgados nessa altura. Até lá preocupe-se apenas em escrever um bom programa.
- Depois do prazo limite ainda se aceitam trabalhos atrasados, mas com penalizações na nota. Mais detalhes nas primeiras linhas deste enunciado.
Outras regras
- Apesar de o projeto ser de grupo, cada aluno, a título individual, tem a responsabilidade de responder por todo o projeto. Assim é indispensável que os dois membros de cada grupo programem efetivamente.
- Não se proíbe que alunos de turnos práticos diferentes façam grupo. Isso é apenas desaconselhado.
- Não há inscrição prévia dos grupos e basta que cada trabalho tenha 2 autores identificados.
- A nota máxima do projeto é 20 valores.
Avaliação
O docente responsável pela gestão e pela avaliação deste trabalho é o Professor Artur Miguel Dias.
A nota do projeto será em grande parte determinada por meios automáticos, através do Mooshak. Portanto é essencial respeitar a especificação contida neste enunciado, em todos os seus detalhes.
Mas, relativamente a programas que funcionem minimamente, também haverá uma apreciação mais subjetiva da qualidade, tendo em conta aspetos, tais como:
- organização,
- clareza e simplicidade das ideias programadas,
- bom uso da linguagem,
- legibilidade do código,
- em alguma medida, eficiência.
Obviamente não é obrigatório fazer o trabalho todo para obter nota positiva. Mas, claro, vale a pena trabalhar para produzir uma solução bastante completa e com a melhor qualidade possível.
Observações
- Os grupos são incentivados a discutir entre si aspetos do projeto, inclusivamente no fórum. Mas sempre que chega o momento de escrever código concreto, esse tem de ser um esforço interno a cada grupo (trabalhando de forma independente de todos os outros grupos). A escrita de código exige esforço intelectual, mas só com esforço se consegue evoluir.
- O objetivo deste projeto é levar os alunos a praticar. Um aluno que pratique de forma genuína ganha experiência e provavelmente não terá dificuldade em conseguir aprovação nos testes e exames.
- Cuidado com as fraudes. Por exemplo, se alguém dum grupo oferecer o projeto resolvido a um elemento de outro grupo, trata-se duma fraude envolvendo dois grupos. Também se um grupo deixa distraidamente a área aberta e se alguém de outro grupo "rouba" o projeto, então também se considera fraude dos dois grupos. Ainda um terceiro caso: se dois grupos se juntam para fazer o projeto conjuntamente e depois o entregam em duplicado, então também se considera fraude. Em suma, cada grupo é responsável pelo seu projeto e não o pode mostrar ou oferecer, direta ou indiretamente, de propósito ou sem querer, o seu código a outro grupo. Note que é muito melhor ter zero num dos três projetos do que ser logo excluído da cadeira por motivo de fraude.
Final
Bom trabalho! Esperamos que goste.