Obtaining Information from Events and Results, the limitations (a draft)

The purpose of information technology is to capture the result of an event. The result is represented or embodied, in general, in a transaction or a report. For heuristic purposes here, I would include business intelligence, semantic data, and sensor data. Again, this is only a heuristic statement. The data may be minute, “big,” or a meaningful semantic and graphs. Whether data sources are large amounts, submitted to massive parallel processing, and analyzed with NEW statistical procedures those may result in trivial reports such as the sheer raw number of “tweets.” Users, databases, other machines, sensors are even for minute events, consumers and producers of results. A result belongs to data curators, data stewards, DBAs, developers, and testers.

An economics of information can be seen as the effort and expense required to capture a result of an event. The economic costs versus benefits of information are measured by the significance and meaning, and values of results. The economics of data, seen in a simple way compares the probability that the benefits exceed the costs Good or bad, true or false, a representations of results can come from any size system. Designation of “data at rest” or “data in motion” is a distinction without a difference. Any transactional result is an instantaneous report, and a report is a persistent, but not necessarily permanent and can be seen as representation of a transaction. At any instant, data in motion must rest in order to be converted into new data or information, and data at rest must move in order to capture history, become master data, or be archived.

Data integration is a means of cutting the fat from the lean of information. Too often typical enterprise architecture stack diagrams or matrices portray “data” as sitting between business intelligence and applications as in the Federal Enterprise Architecture Framework (FEAF). The FEAF model reduces data architecture to a storage and management function of applications. The “data” element is supported by applications and technology. In contrast to this, Zachman’s framework gives “Data” a cross cutting importance through all layers. Some Zachman diagrams name this first column “What” and other John Zachman diagrams label it “Data.” However, no application is worth more than the result of the data captured. The foundation of systems should be seen in terms of their function, not in terms of a popular sensibility looking for a technology or infrastructure foundation.

It is important how data” is depicted” in any ‘stack’ diagram. No matter how the rest of the application and infrastructural are stood up or configured, the referential integrity and semantic continuity are essential. Representation of where “data” sits or in what “swim lane” it appears, conveys meaning.

Typical IT Stack Diagram

BI, Report, GIS

Data

Applications

Infrastructure

The role of data is minimized in this representation and depicted as supported by the infrastructure, and not as a pervasive, cross-cutting, requirement. Furthermore, the fundamental ground of data is the “semantic layer.” There is no semantic “layer” in a swim lane by itself. Even such a robust software development book as “Design Driven Development” emphasizes the need to ensure understanding of
semantic content of data.

When “glossary” or “vocabulary” words are used to attempt to identify data semantics that does not mean that either is a complete or comprehensive or enterprise approach. A glossary can refer to only the words in a single system, API, or group of applications, or any non-enterprise development. A glossary can have no or little relationship to foundational meanings. Even “semantics” can be assumed to be equivalent to glossary. These views of meaning may reflect a strictly as-is and bottom-up approach to capture concepts that comprise physical data models. However, a to-be and top-down approach starts with a canonical model.

A canonical environment and its semantic derivations are foundations of continuity from data collection to analytics.  Building an ontology (or trying to automate discovery of one) or using Natural Language Processing to see into data and to check on its validity, and organizing data are foundations. Data cannot be analyzed which is not collected in the first place, and data that isn’t collected consistently is probably worthless. The data collected is a result of an event no matter how transient or persistent. The trajectory of data collection is analysis.

Nevertheless, there are three major contradictions in the organization of “analytics.”

  1. Creating and maintaining a “controlled vocabulary” and semantic continuity is possible, but doing so may not keep up with changes needed by users to gain analytical insight.
  2. Making faster and flexible self-service BI applications may be desirable, but doing so may be at done the cost of data quality.
  3. Relying solely on a client’s statement of a data problem may be “business” oriented, but may miss insights into the actual substance of the problem at hand. This is not an IT problem – it is not a problem of too much or too little data – it is a problem of knowing the subject at hand (medicine, health care, customer demographics, geography, housing finance, agribusiness, civil engineering, urban design, linguistics, logic, and all the rest).

The purpose of information technology is not software development for its own sake. Definitions of information technology may just be a list of the means of creating systems and data with emphasis on the technology, and not the information.