DATA STORAGE
Data gives rise to information that in turn gives rise to knowledge. Knowledge leads to understanding.
Understanding leads to wisdom.
Data may be univariate if it has only one variable. It may be bivariate if it has two variables allowing
correlation. It may be multivariate with several variables allowing more sophisticated analyses.
A document is stored data in any form: paper, book, letter, message, image, e-mail, voice, and sound.
Some documents are ephemeral but can still be retrieved for the brief time that they exist and are recoverable.
Data is physically stored as bytes. A byte has 8 bits and can therefore represent 28 = 256
characters.
ASCII is a machine language that uses only 127 codes (95 character codes and 25 control codes). ANSI
is an extension of ASCII used by Microsoft. Different languages use different numbers of codes for example Greek uses 219
characters, Cyrillic uses 259 characters, Arabic uses 196 characters, and Chinese uses 65, 536 characters.
Data compression facilitates data storage and data retrieval. Data compression makes document retrieval
easier because the search is carried out in a smaller space. Character, image, and sound data can all be compressed; however
compression may involve loss of some data.
Data may be formatted in tables of several types of databases (relational, hierarchical, and network).
It may be unformatted such as images, sound, or electronic monitoring in the hospital. Formatted documents are easier to retrieve.
Files may be sequential files, indexed files, tree structured files, and clustered files. Files may
be described as sequential, indexed, tree structured, or clustered.
Medline and PDQ are examples of medical data bases. MEDLINE was established in 1971. Every year 400,000
articles from 3,700 journals are added and are indexed using medical subject headings (MESH). GRATEFUL MED is a query language
used to search MEDLINE. PDQ is a data base about cancer
DATA RETRIEVAL
Document surrogates used in data retrieval are: identifiers, abstracts, extracts, reviews, indexes,
and queries.
Queries are short documents used to retrieve larger documents by matching, mapping, or use of Boolean
logic (and, or, but). Queries may in natural or probabilistic language. Fuzzy queries are deliberately not rigid to increase
the probability of retrieval.
Other forms of data retrieval are term extraction (based on low frequency of important terms), term
association (based on terms that normally occur together), lexical measures (using specialized formulas), trigger phrases
(like figure, table, conclusion), synonyms (same meaning), antonyms (opposite meaning), homographs (same spelling but different
meaning), and homonyms (same sound but different spelling). Stemming algorithms help in retrieval by removing ends of words
leaving only the roots. Specialized mathematical techniques are used to assess the effectiveness of data retrieval.
DATA WAREHOUSING
Data warehousing is a method of extraction of data from various sources, storing it as historical and
integrated data for use in decision-support systems. Meta data is a term used for definition of data stored in the data warehouse (i.e. data
about data). A data model is a graphic representation of the data either as diagrams or charts. The data model reflects the
essential features of an organization. The purpose of a data model is to facilitate communication between the analyst and
the user. It also helps create a logical discipline in database design.
DATA MINING
Data mining is the discovery part of knowledge discovery in data (KDD) involving
knowledge engineering, classification, and problem solving. KDD starts with selection, cleaning, enrichment, and coding. The
products of data mining are pattern recognition. These patterns are then applied to new situations in predicting and profiling.
Artificial intelligence (AI), based on machine learning, imbues computers with some creativity and decision making capabilities
using specific algorithms.
DATA REPLICATION
Data replication is a copy management service that involves copying
the data and also managing the copies. It ensures that all parts of the organization have access to updated data. It is also
an insurance against data loss in case of computer crashes because there will be an alternative data source. Databases must
be designed and configured to facilitate replication. The replication infrastructure must be in place from the start. Care
must be taken to make sure that replicated data is consistent and in synchrony with the master copy. The process of replication
may inadvertently create redundancy in the system. In synchronous data replication there is no latency in data consistency.
All replicas of the data are the same because of immediate updating. In asynchronous data replication the updating is not
immediate and consistency is loose. Asynchronous replication is easier and cheaper.