Data Management/Data Warehousing Definitions

This glossary explains the meaning of key words and phrases that information technology (IT) and business professionals use when discussing data management and related software products. You can find additional definitions by visiting WhatIs.com or using the search box below.

Search Definitions

#
5 V's of big data

The 5 V's of big data are the five main and innate characteristics of big data.
A
ACID (atomicity, consistency, isolation, and durability)

ACID (atomicity, consistency, isolation, and durability) is an acronym and mnemonic device for learning and remembering the four primary attributes ensured to any transaction by a transaction manager (which is also called a transaction monitor).
ActiveX Data Objects (ADO)

ActiveX Data Objects (ADO) is an application program interface from Microsoft that lets a programmer writing Windows applications get access to a relational or non-relational database from both Microsoft and other database providers.
AdventureWorks Database

AdventureWorks Database is a sample OLTP database that Microsoft ships with all of its SQL Server database products.
Apache Flink

Apache Flink is a distributed data processing platform for use in big data applications, primarily involving analysis of data stored in Hadoop clusters.

Apache Giraph

Apache Giraph is real-time graph processing software that is mostly used to analyze social media data. Giraph was developed by Yahoo! and given to the Apache Software Foundation for future management.
Apache Hadoop YARN

Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework.
Apache HBase

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS).
Apache Hive

Apache Hive is an open source data warehouse system for querying and analyzing large data sets that are principally stored in Hadoop files.
Apache Pig

Apache Pig is an open-source technology that offers a high-level mechanism for parallel programming of MapReduce jobs to be executed on Hadoop clusters.
Apache Spark

Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers.
Azure Data Studio (formerly SQL Operations Studio)

Azure Data Studio is a Microsoft tool, originally named SQL Operations Studio, for managing SQL Server databases and cloud-based Azure SQL Database and Azure SQL Data Warehouse systems.
Azure SQL Data Warehouse

Azure SQL Data Warehouse is a managed Data Warehouse-as-a Service (DWaaS) offering provided by Microsoft Azure.
B
big data

Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications.
big data engineer

A big data engineer is an information technology (IT) professional who is responsible for designing, building, testing and maintaining complex data processing systems that work with large data sets.
big data management

Big data management is the organization, administration and governance of large volumes of both structured and unstructured data.
C
C++

C++ is an object-oriented programming (OOP) language that is viewed by many as the best language for creating large-scale applications.
column database management system (CDBMS)

There are different types of CDBMS offerings, with the common defining feature being that data is stored by column (or column families) instead of as rows.
columnar database

A columnar database is a database management system (DBMS) that stores data in columns instead of rows.
compliance

Compliance is the state of being in accordance with established guidelines or specifications, or the process of becoming so.
conformed dimension

In data warehousing, a conformed dimension is a dimension that has the same meaning to every fact with which it relates.
consumer privacy (customer privacy)

Consumer privacy, also known as customer privacy, involves the handling and protection of the sensitive personal information provided by customers in the course of everyday transactions.
cooked data

Cooked data is raw data after it has been processed - that is, extracted, organized, and perhaps analyzed and presented - for further use.
corporate performance management (CPM)

Corporate performance management (CPM) is a term used to describe the various processes and methodologies involved in aligning an organization's strategies and goals to its plans and executions in order to control the success of the company.
CouchDB

CouchDB is an open source document-oriented database based on common web standards. NoSQL databases are useful for very large sets of distributed data, especially for the large amounts of non-uniform data in various formats that is characteristic of Web-based data.
CRUD cycle (Create, Read, Update and Delete Cycle)

The CRUD cycle describes the elemental functions of a persistent database in a computer.
customer data integration (CDI)

Customer data integration (CDI) is the process of defining, consolidating and managing customer information across an organization's business units and systems to achieve a "single version of the truth" for customer data.
D
dark data

Dark data is digital information an organization collects, processes and stores that is not currently being used for business purposes.
data

In computing, data is information that has been translated into a form that is efficient for movement or processing.
data access rights

A data access right (DAR) is a permission that has been granted that allows a person or computer program to locate and read digital information at rest. Digital access rights play and important role in information security and compliance.
data activation

Data activation is a marketing approach that uses consumer information and data analytics to help companies gain real-time insight into target audience behavior and plan for future marketing initiatives.
data aggregation

Data aggregation is any process whereby data is gathered and expressed in a summary form.
data analytics (DA)

Data analytics (DA) is the process of examining data sets in order to find trends and draw conclusions about the information they contain.
data architect

A data architect is an IT professional responsible for defining the policies, procedures, models and technologies to be used in collecting, organizing, storing and accessing company information.
Data as a Service (DaaS)

Data as a Service (DaaS) is an information provision and distribution model in which data files (including text, images, sounds, and videos) are made available to customers over a network, typically the Internet.
data catalog

A data catalog is a software application that creates an inventory of an organization's data assets to help data professionals and business users find relevant data for analytics uses.
data classification

Data classification is the process of organizing data into categories that make it is easy to retrieve, sort and store for future use.
data cleansing (data cleaning, data scrubbing)

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set.
Data Dredging (data fishing)

Data dredging -- sometimes referred to as data fishing -- is a data mining practice in which large data volumes are analyzed to find any possible relationships between them.
data engineer

A data engineer is an IT worker whose primary job is to prepare data for analytical or operational uses.
data fabric

A data fabric is an architecture and software offering a unified collection of data assets, databases and database architectures within an enterprise.
data federation software

Data federation software is programming that provides an organization with the ability to collect data from disparate sources and aggregate it in a virtual database where it can be used for business intelligence (BI) or other analysis.
data flow diagram (DFD)

A data flow diagram (DFD) is a graphical or visual representation using a standardized set of symbols and notations to describe a business's operations through data movement.
data integration

Data integration is the process of combining data from multiple source systems to create unified sets of information for both operational and analytical uses.
data lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.
data lakehouse

A data lakehouse is a data management architecture that combines the benefits of a traditional data warehouse and a data lake.
data management as a service (DMaaS)

Data management as a service (DMaaS) is a type of cloud service that provides enterprises with centralized storage for disparate data sources.
data mart (datamart)

A data mart is a repository of data that is designed to serve a particular community of knowledge workers.
data modeling

Data modeling is the process of creating a simplified diagram of a software system and the data elements it contains, using text and symbols to represent the data and how it flows.
data pipeline

A data pipeline is a system that moves data from one (source) location to another (target) location, much like how an oil pipeline moves oil from one location to another.
data preprocessing

Data preprocessing, a component of data preparation, describes any type of processing performed on raw data to prepare it for another data processing procedure.
data profiling

Data profiling refers to the process of examining, analyzing, reviewing and summarizing data sets to gain insight into the quality of data.
data quality

Data quality is a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date.
data silo

A data silo exists when an organization's departments and systems cannot, or do not, communicate freely with one another and encourage the sharing of business-relevant data.
data stewardship

Data stewardship is the management and oversight of an organization's data assets to help provide business users with high-quality data that is easily accessible in a consistent manner.
data structures

A data structure is a specialized format for organizing, processing, retrieving and storing data.
data transformation

Data transformation is the process of converting data from one format, such as a database file, XML document or Excel spreadsheet, into another.
data validation

Data validation is the practice of checking the integrity, accuracy and structure of data before it is used for a business operation.
data virtualization

Data virtualization is an umbrella term used to describe any approach to data management that allows an application to retrieve and manipulate data without needing to know any technical details about the data such as how it is formatted or where it is physically located.
data warehouse

A data warehouse is a federated repository for all the data collected by an enterprise's various operational systems, be they physical or logical.
data warehouse as a service (DWaaS)

Data warehouse as a service (DWaaS) is an outsourcing model in which a cloud service provider configures and manages the hardware and software resources a data warehouse requires, and the customer provides the data and pays for the managed service.
database (DB)

A database is a collection of information that is organized so that it can be easily accessed, managed and updated.
database administrator (DBA)

A database administrator (DBA) is the information technician responsible for directing or performing all activities related to maintaining a successful database environment.
database as a service (DBaaS)

Database as a service (DBaaS) is a cloud computing managed service offering that provides access to a database without requiring the setup of physical hardware, the installation of software or the need to configure the database.
database management system (DBMS)

A database management system (DBMS) is system software for creating and managing databases, allowing end users to create, protect, read, update and delete data in a database.
database normalization

Database normalization is intrinsic to most relational database schemes. It is a process that organizes data into tables so that results are always unambiguous.
database replication

Database replication is the frequent electronic copying of data from a database in one computer or server to a database in another -- so that all users share the same level of information.
database-agnostic

Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system.
DataOps (data operations)

DataOps (data operations) is an Agile approach to designing, implementing and maintaining a distributed data architecture that will support a wide range of open source tools and frameworks in production. The goal of DataOps is to create business value from big data.
Db2

Db2 is a family of database management system (DBMS) products from IBM that serve a number of different operating system (OS) platforms.
denormalization

Denormalization is the process of adding precomputed redundant data to an otherwise normalized relational database to improve read performance of the database.
deterministic/probabilistic data

Deterministic and probabilistic are opposing terms that can be used to describe customer data and how it is collected. Deterministic data is also referred to as first party data. Probabilistic data is information that is based on relational patterns and the likelihood of a certain outcome.
dimension

In data warehousing, a dimension is a collection of reference information about a measurable event (fact).
dimension table

A dimension table is a table in a star schema of a data warehouse. A dimension table stores attributes, or dimensions, that describe the objects in a fact table.
dirty data

In a data warehouse, dirty data is a database record that contains errors.
disambiguation

Disambiguation (also called word sense disambiguation) is the act of interpreting the intended sense or meaning of a word. Disambiguation is a common problem in computer language processing, since it is often difficult for a computer to distinguish a word’s sense when the word has multiple meanings or spellings.
What is data architecture? A data management blueprint

Data architecture is a discipline that documents an organization's data assets, maps how data flows through its systems and provides a blueprint for managing data.
What is data governance and why does it matter?

Data governance (DG) is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal data standards and policies that also control data usage.
What is data management and why is it important?

Data management is the process of ingesting, storing, organizing and maintaining the data created and collected by an organization, as explained in this in-depth look at the process.
E
Entity Relationship Diagram (ERD)

An entity relationship diagram (ERD), also known as an entity relationship model, is a graphical representation that depicts relationships among people, objects, places, concepts or events within an information technology (IT) system.
Extract, Load, Transform (ELT)

Extract, Load, Transform (ELT) is a data integration process for transferring raw data from a source server to a data system (such as a data warehouse or data lake) on a target server and then preparing the information for downstream uses.
F
fact table

A fact table is the central table in a star schema of a data warehouse. A fact table stores quantitative information for analysis and is often denormalized.
feature engineering

Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning.
fetch

In computer technology, fetch has several meanings related to getting, reading, or moving data objects.
FileMaker (FMP)

FileMaker is a relational database application in which an individual may design -- and easily share on the Internet -- a database file by starting with a blank document or implementing ready-made and customizable templates.
fixed data (permanent data, reference data, archival data, or fixed-content data)

Fixed data (sometimes referred to as permanent data) is data that is not, under normal circumstances, subject to change. Any type of historical record is fixed data. For example, meteorological details for a given location on a specific day in the past are not likely to change (unless the original record is found, somehow, to be flawed).
flat file

A flat file is a collection of data stored in a two-dimensional database in which similar yet discrete strings of information are stored as records in a table.
full-text database

A full-text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing, printing, or downloading. In addition to text documents, images are often included, such as graphs, maps, photos, and diagrams. A full-text database is searchable by keyword, phrase, or both.
G
Google BigQuery

Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets.
Google Bigtable

Google Bigtable is a distributed, column-oriented data store created by Google Inc. to handle very large amounts of structured data associated with the company's Internet search and Web services operations.
Google Cloud Dataflow

Google Cloud Dataflow is a cloud-based data processing service for both batch and real-time data streaming applications.
Google Cloud Spanner

Google Cloud Spanner is a distributed relational database service designed to support global online transaction processing deployments, SQL semantics, horizontal scaling and transactional consistency.
H
Hadoop

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.
Hadoop 2

Apache Hadoop 2 is the second iteration of the Hadoop framework for distributed data processing. Hadoop 2 adds support for running non-batch applications as well as new features to improve system availability.
Hadoop data lake

A Hadoop data lake is a data management platform comprising one or more Hadoop clusters.
Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.
hashing

Hashing is the process of transforming any given key or a string of characters into another value.
Heap

Heap is a user analytics tool which can be utilized to capture all web, mobile and cloud-based user interactions in an application.
I
in-memory database management system (IMDBMS)

An in-memory database management system (IMDBMS) stores, manages and provides access to data from main memory.
information

Information is stimuli that has meaning in some context for its receiver. When information is entered into and stored in a computer, it is generally referred to as data.