XML and Semi-Structured Data


XML is a Markup Language that defines a set of rules for encoding document in a format, that is both human-readable and machine-readable. Itstands for eXtensible Markup Language. It was designed to store and transport data.


<?XMLVersion="1.0" encoding="UTF-8"?>

Markup and content :-

The characters making up an XML document are divided into markup and content, which may be distinguished by the application of simple syntactic rules.

Generally, strings that constitute markup either begin with the character < and end with a >, or they begin with the character & and end with a ;. Strings of characters that are not markup are content. In addition, whitespace before and after the outermost element is classified as markup.

Tag :-

A tag is a markup construct that begins with < and ends with >. Tag names are case-sensitive; the start-tag and end-tag must match exactly.

Tag names cannot contain any of the characters !"#$%&'()*+,/;<=>?@[\]^'{|}~, nor a space character, and cannot begin with "-", ".", or a numeric digit.

A single root element contains all the other elements.

Tags come in three flavors:

start-tag, such as <section>;

end-tag, such as </section>;

empty-element tag, such as <line-break />.

Element :-

An element is a logical document component that either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element's content, and may contain markup, including other elements, which are called child elements.

An example is <greeting>Hello, world!</greeting>.

Another is <line-break />.

Attribute :-

An attribute is a markup construct consisting of a name-value pair that exists within a start-tag or empty-element tag.

An example is <img src="madonna.jpg" alt="Madonna" />, where the name of the attributes are "src" and "alt" and their values are "madonna.jpg" and "Madonna" respectively.

 Another example is <step number="3">Connect A to B.</step>, where the name of the attribute is "number" and its value is "3".

An XML attribute can only have a single value and each attribute can appear at most once on each element. In the common situation where a list of multiple values is desired, this must be done by encoding the list into a well-formed XML attribute with some format beyond what XML defines itself. Usually this is either a comma or semi-colon delimited list or, if the individual values are known not to contain spaces, a space-delimited list can be used. <div class="inner greeting-box">Welcome!</div>, where the attribute "class" has both the value "inner greeting-box" and also indicates the two CSS class names "inner" and "greeting-box".

XML declaration :-

XML documents may begin with an XML declaration that describes some information about them.

 An example is <?xml version="1.0" encoding="UTF-8"?>.

Schemas and validation :-

In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD) and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies.

A DTD is an example of a schema or grammar. The oldest schema language for XML is the Document Type Definition (DTD)

DTDs have the following benefits:-

Ø  DTD support is ubiquitous due to its inclusion in the XML 1.0 standard.

Ø  DTDs are terse compared to element-based schema languages and consequently present more information in a single screen.

Ø  DTDs allow the declaration of standard public entity sets for publishing characters.

Ø  DTDs define a document type rather than the types used by a namespace, thus grouping all constraints for a document in a single collection.

DTDs have the following limitations:-

Ø  They have no explicit support for newer features of XML, most importantly namespaces.

Ø  They lack expressiveness. XML DTDs are simpler than SGML DTDs and there are certain structures that cannot be expressed with regular grammars. DTDs only support rudimentary datatypes.

Ø  They lack readability. DTD designers typically make heavy use of parameter entities (which behave essentially as textual macros), which make it easier to define complex grammars, but at the expense of clarity.

Ø  They use a syntax based on regular expression syntax, inherited from SGML, to describe the schema. Typical XML APIs such as SAX do not attempt to offer applications a structured representation of the syntax, so it is less accessible to programmers than an element-based syntax may be.

Characteristic of XML:-

1.    Extensible:- XML allows to create self-descriptive tags or language.

2.    Carries the data:-XML allows to store the data irrespective of how it will be presented.

3.    Public standard:-XML was developed by an organization called the World Wide Web Consortium (w3c) and is available as an open standard. 

XML Usage:-

1.    The basic use of XML is distributing data over the internet.

2.    XML can be used for offloading and reloading of databases.

3.    XML can be used to exchange the information between organization and system.

4.    XML customize data handling needs.

5.    Virtually, any type of data can be expressed as an XML document.

6.    XML used to simplify the creation of HTML documents for large web sites.


Not allowed character

Replacement entity



Less Than


Greater Than









Quotation Mark


            <?XML Version="1.0"?>



                        <address> add1 </address>

                        <phone> 012011 </phone>


An XML document can have only one root element.

Semi-Structured Data:-

Semi-structured data is data that is neither raw data, nor typed data in a conventional database system. It is a form of structured data that does not conform to the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

The problem is that some data is not controlled to difficult structuring - and such data is becoming more and more popular. A great deal of data relevant to the enterprise is turning up in documents, images and emails as well as tweets and other social media data. All of these can be described as semi-structured data.

Options for managing semi-structured data are:-

a)     Ignore it (probably fatal in a competitive climate) :-

This is a good one to rule out, So much data is being created and collected in semi-structured forms that most enterprises cannot afford to disregard the outpouring of it. Doing so is viable only if there is no compelling business advantage in being able to track and analyse such data.

b)     Force it into structured relational form :-

Relational database engines have been significantly modified over the years to handle what are characterised by the database manufacturers as "complex data types".

XML is one example: It is considered by many to be an excellent way of holding classic semi-structured data. Most common document formats are, or can be, rendered into XML, and almost all relational engines now have an XML data type, which means that documents often can be stored in a relational database. But the additional complexity of handling semi-structured data means there will inevitably be a trade-off, and in general that will equate to slower retrieval times. However, it does make it very easy to find all tweets that refer to your product, all emails that mention "politician" and so on.

Other examples of complex data types are those that can handle spatial and image data. 

c)      Adopt a different storage mechanism:-

There is increasing interest in adopting alternative data management and storage mechanisms. Imagine you store patient X-rays as images. We store data so we can retrieve it later and also so we can query it, but running a query against an X-ray image is a somewhat bizarre concept because the X-ray is simply a collection of pixels. What often happens in practice is that this and other semi-structured data comes with some attached metadata and can also undergo some form of analysis in order to generate further metadata. (In a nutshell, metadata is data about data). In the case of an email, the attached metadata might include length, sender, recipient, time/date and so on. Automatic semantic analysis of the email could be performed and that might yield metadata about the tone of the email (as in, angry, conciliatory, praising, etc.), its grammatical construction (correct, lax, etc.) and so on.

Metadata is typically highly structured and is therefore highly susceptible to analysis. So we could then store the emails and the metadata in a relational database and query the metadata to find, not just those emails that mention your product, but more specifically those that are well-written and also positive about the product.

Types of semi-structured data:- 

1.    XML :- (eXtensible Markup Language)

It is a Markup Language that defines a set of rules for encoding document in a format, that is both human-readable and machine-readable. It was designed to store and transport data.         

The basic use of XML is distributing data over the internet.

2.    JSON :-( JavaScript Object Notation)

It is an open standard format that uses human-readable text to transmit data objects consisting of attribute-value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML.

            There is a new breed of databases such as MongoDB and Couchbase that store data natively in JSON format.

Advantages of Semi-Structured data :-

1.    No need to worry about object-relational impedance mismatch.

2.    Supports for nested or hierarchical data often simplify data models representing complex relationships between entities.

3.    Supports for lists of objects simplify data models by avoiding messy translations of lists into a relational data model.


1.    The traditional relational data models has a popular and readymade query languages, SQL.

2.    There is a problem with "Garbage-in" and "Garbage-out", that is necessary to operate a data application.


More than 18, 378, 87 Solved Course Assignments and Q&A, Easy Download!! Find Now