Zeljko Obrenovic' Technology Catalog

Big data analytics is the process of collecting, organizing and analyzing large sets of data (called big data) to discover patterns and other useful information.

DATA STORAGES

Distributes File Systems (2)

NoSQL Databases

Key-Value (3)

Document-Oriented (2)

Column-Family (2)

Graph-Oriented (2)

Analytic RDBMS

MPP Analytic RDBMS (4)

Traditional Analytic RDBMS (6)

INTEGRATION

Messaging

Data Collectors (3)

Distributed Message Brokers (4)

ETL/ELT (6)

PROCESSING & ANALYTICS

Visualization & Reporting

BI Platforms (10)

Interactive Dashboards (3)

Graphical Libraries (2)

Search & Query

Interactive Query Engines (3)

Distributed Query Engines (4)

Processing

Distributed Computing Engines (3)

Event Stream Processors (4)

Data Processing Frameworks (3)

DATA STORAGES

Distributes File Systems (2)

Commonly known as network file systems, allows files to be accessed using the same interfaces and semantics as local files – for example, mounting/unmounting, listing directories, read/write at byte boundaries, system's native permission model.

Apache Hadoop
A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Links:
- https://hadoop.apache.org/
- https://en.wikipedia.org/wiki/Apache_Hadoop
Apache Cassandra FS
A FS of Cassandra DB, designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Apache Cassandra is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all client
Links:

NoSQL Databases

Provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases

Key-Value (3)

A data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash.

Riak
A distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity, and scalability.
Riak (pronounced "ree-ack") is a distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity, and scalability. In addition to the open-source version, it comes in a supported enterprise version and a cloud storage version. Riak implements the principles from Amazon's Dynamo paper with heavy influence from the CAP Theorem. Written in Erlang, Riak has fault tolerance data replication and automatic data distribution across the cluster for performance and resilience.
Links:
- http://basho.com/products/
- https://en.wikipedia.org/wiki/Riak
Redis
Networked, in-memory, and stores keys with optional durability.
Redis implements data structure servers. It is networked, in-memory, and stores keys with optional durability. Redis maps keys to types of values. ARedis supports not only strings, but also abstract data types.
Links:
- https://redis.io/
- https://en.wikipedia.org/wiki/Redis
Oracle Berkley DB
A software library intended to provide a high-performance embedded database for key/value data.
Berkeley DB (BDB) is a software library intended to provide a high-performance embedded database for key/value data. Berkeley DB is written in C with API bindings for C++, C#, Java, Perl, PHP, Python, Ruby, Smalltalk, Tcl, and many other programming languages. BDB stores arbitrary key/data pairs as byte arrays, and supports multiple data items for a single key.
Links:
- http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html
- https://en.wikipedia.org/wiki/Berkeley_DB

Document-Oriented (2)

Systems for storing, retrieving and managing document-oriented information, also known as semi-structured data.

A document-oriented database, or document store, is a computer program designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown with the use of the term NoSQL itself. XML databases are a subclass of document-oriented databases that are optimized to work with XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.

Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization. Although the difference is often moot due to tools in the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques.

Document databases contrast strongly with the traditional relational database (RDB). Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This makes mapping objects into the database a simple task, normally eliminating anything similar to an object-relational mapping. This makes document stores attractive for programming web applications, which are subject to continual change in place, and where speed of deployment is an important issue.

Links:

https://en.wikipedia.org/wiki/Document-oriented_database

MongoDB
A free and open-source cross-platform document-oriented database program.
MongoDB (from humongous) is a free and open-source cross-platform document-oriented database program. MongoDB uses JSON-like documents with schemas.
Links:
- https://www.mongodb.com/
- https://en.wikipedia.org/wiki/MongoDB
Apache CouchDB
A document-oriented NoSQL database architecture, implemented in the concurrency-oriented language Erlang; uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
Apache CouchDB is open source database software that focuses on ease of use and having an architecture that "completely embraces the Web". It has a document-oriented NoSQL database architecture and is implemented in the concurrency-oriented language Erlang; it uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
Links:
- https://couchdb.apache.org/
- https://en.wikipedia.org/wiki/CouchDB

Column-Family (2)

Use a concept called a keyspace, containing all the column families, which contain rows, which contain columns.

Apache HBase
An open source, non-relational, distributed database modeled after Google's Bigtable and is written in Java.
HBase is an open source, non-relational, distributed database modeled after Google's Bigtable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-like capabilities for Hadoop. It provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
Links:
- https://hbase.apache.org/
- https://en.wikipedia.org/wiki/Apache_HBase
Apache Cassandra
A free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Apache Cassandra is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all client
Links:

Graph-Oriented (2)

Uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

Neo4J
An ACID-compliant transactional database with native graph storage and processing.
Neo4j is a graph database management system developed by Neo Technology, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing.
Links:
- https://neo4j.com/
- https://en.wikipedia.org/wiki/Neo4j
OrientDB
A multi-model database, supporting graph, document, key/value, and object models, but the relationships are managed as in graph databases.
OrientDB is an open source NoSQL database management system written in Java. It is a multi-model database, supporting graph, document, key/value, and object models, but the relationships are managed as in graph databases with direct connections between records. It supports schema-less, schema-full and schema-mixed modes.
Links:
- http://orientdb.com/orientdb/
- https://en.wikipedia.org/wiki/OrientDB

Analytic RDBMS

MPP Analytic RDBMS (4)

Use multi-core processors, multiple processors and servers, and storage appliances equipped for parallel processing, enables reading many pieces of data across many processing units at the same time for enhanced speed.

HP Vertica
An analytic database management software company.
Vertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker, and Andrew Palmer. Former CEOs include Ralph Breslauer and Christopher P. Lynch.

Vertica was acquired by Hewlett Packard on March 22, 2011. The acquisition expanded the HP Software software portfolio for enterprise companies and the public sector group. On September 1, 2017, it was merged with Micro Focus.

Links:
- https://www.vertica.com/
- https://en.wikipedia.org/wiki/Vertica
Teradata
A fully scalable relational database management system produced, widely used to manage large data warehousing operations.
Teradata is a fully scalable relational database management system produced by Teradata Corp. It is widely used to manage large data warehousing operations.

The Teradata database system is based on off-the-shelf symmetric multiprocessing technology combined with communication networking, connecting symmetric multiprocessing systems to form large parallel processing systems.

Links:
- http://www.teradata.com/products-and-services/Teradata-Database/
- https://en.wikipedia.org/wiki/Teradata
Microsoft APS
A massively parallel processing (MPP) SQL Server appliance.
Formerly Parallel Data Warehouse (PDW). A massively parallel processing (MPP) SQL Server appliance optimized for large-scale data warehousing such as hundreds of terabytes.
Links:
- https://www.microsoft.com/en-us/sql-server/analytics-platform-system
- https://en.wikipedia.org/wiki/Microsoft_Analysis_Services
[AWS] Redshift
A hosted data warehouse product, built on top of technology from the massive parallel processing (MPP) data-warehouse company ParAccel.
Amazon Redshift, a hosted data warehouse product, forms part of the larger cloud-computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data-warehouse company ParAccel (later acquired by Actian). Redshift differs from Amazon's other hosted database offering, Amazon RDS, in its ability to handle analytics workloads on large-scale datasets stored by a column-oriented DBMS principle. To be able to handle large scale datasets Amazon makes use of massive parallel processing.
Links:
- https://aws.amazon.com/redshift/
- https://en.wikipedia.org/wiki/Amazon_Redshift

Traditional Analytic RDBMS (6)

Microsoft SQL Server
A relational database management system developed by Microsoft.
Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network (including the Internet).
Links:
Oracle RDBMS
An object-relational database management system produced and marketed by Oracle Corporation.
Oracle Database (commonly referred to as Oracle RDBMS or simply as Oracle) is an object-relational database management system produced and marketed by Oracle Corporation.
Links:
- https://www.oracle.com/database/index.html
- https://en.wikipedia.org/wiki/Oracle_Database
IBM DB2
Database server products developed by IBM.
IBM DB2 contains database server products developed by IBM. These products all support the relational model, but in recent years some products have been extended to support object-relational features and non-relational structures like JSON and XML.
Links:
- http://www-01.ibm.com/software/data/db2/
- https://en.wikipedia.org/wiki/IBM_Db2
MySQL
An open-source relational database management system (RDBMS).
MySQL is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter, and "SQL", the abbreviation for Structured Query Language. The MySQL development project has made its source code available under the terms of the GNU General Public License, as well as under a variety of proprietary agreements. MySQL was owned and sponsored by a single for-profit firm, the Swedish company MySQL AB, now owned by Oracle Corporation. For proprietary use, several paid editions are available, and offer additional functionality.

MySQL is a central component of the LAMP open-source web application software stack (and other "AMP" stacks). LAMP is an acronym for "Linux, Apache, MySQL, Perl/PHP/Python". Applications that use the MySQL database include: TYPO3, MODx, Joomla, WordPress, phpBB, MyBB, and Drupal. MySQL is also used in many high-profile, large-scale websites, including Google (though not for searches), Facebook, Twitter, Flickr, and YouTube.

Links:
- https://en.wikipedia.org/wiki/MySQL
PostgreSQL
An object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance.
PostgreSQL, often simply Postgres, is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance. As a database server, its primary functions are to store data securely and return that data in response to requests from other software applications. It can handle workloads ranging from small single-machine applications to large Internet-facing applications (or for data warehousing) with many concurrent users; on macOS Server, PostgreSQL is the default database; and it is also available for Microsoft Windows and Linux (supplied in most distributions).

PostgreSQL is ACID-compliant and transactional. PostgreSQL has updatable views and materialized views, triggers, foreign keys; supports functions and stored procedures, and other expandability.

PostgreSQL is developed by the PostgreSQL Global Development Group, a diverse group of many companies and individual contributors. It is free and open-source, released under the terms of the PostgreSQL License, a permissive software license.

Links:
- https://en.wikipedia.org/wiki/PostgreSQL
Adaptive Server Enterprise
A relational model database server product for businesses developed by Sybase Corporation.
SAP ASE (Adaptive Server Enterprise), originally known as Sybase SQL Server, and also commonly known as Sybase DB or ASE, is a relational model database server product for businesses developed by Sybase Corporation which became part of SAP AG. ASE is predominantly used on the Unix platform, but is also available for Microsoft Windows.
Links:
- https://en.wikipedia.org/wiki/Adaptive_Server_Enterprise

INTEGRATION

Messaging

Use messaging to transfer packets of data frequently, immediately, reliably, and asynchronously, using customizable formats.

Data Collectors (3)

Apache Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Architecture based on streaming data flows. Robust, fault tolerant, tunable reliability mechanisms.Failover and recovery mechanisms.
Links:
- https://flume.apache.org/
- https://en.wikipedia.org/wiki/IBM_Db2
Logstash
An open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite “stash.”
Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite “stash.” (e.g Elasticsearch)
Links:
- https://www.elastic.co/products/logstash
Fluentd
An open source data collector for unified logging layer.
Fluentd is an open source data collector for unified logging layer.
Links:
- http://www.fluentd.org/
- https://en.wikipedia.org/wiki/Fluentd

Distributed Message Brokers (4)

Apache Kafka
Used for building real-time data pipelines and streaming apps
Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, fast, and runs in production in thousands of companies.
Links:
- https://kafka.apache.org/
- https://en.wikipedia.org/wiki/Apache_Kafka
RabbitMQ
A message broker software that implements the Advanced Message Queuing Protocol (AMQP).
RabbitMQ is message broker software (sometimes called message-oriented middleware) that implements the Advanced Message Queuing Protocol (AMQP).
Links:
- https://www.rabbitmq.com/
- https://en.wikipedia.org/wiki/RabbitMQ
[AWS] SQS
A fully-managed message queuing service.
Amazon Simple Queue Service (SQS) is a fully-managed message queuing service for reliably communicating among distributed software components and microservices - at any scale.
Links:
- https://aws.amazon.com/sqs/
- https://en.wikipedia.org/wiki/Amazon_Simple_Queue_Service
Apache ActiveMQ
A message broker with a full Java Message Service (JMS) client.
Apache ActiveMQ is a message broker with a full Java Message Service (JMS) client. Several modes for high availability, both file-system and database row-level locking mechanisms, sharing of the persistence store via a shared filesystem, or true replication using Apache ZooKeeper. A robust horizontal scaling mechanism, called a Network of Brokers.
Links:
- http://activemq.apache.org/
- https://en.wikipedia.org/wiki/Apache_ActiveMQ

ETL/ELT (6)

Microsoft SSIS
A component of the Microsoft SQL Server, a platform for data integration and workflow applications.
SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks. SSIS is a platform for data integration and workflow applications. It features a data warehousing tool used for data extraction, transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server databases and updates to multidimensional cube data.
Links:
- https://en.wikipedia.org/wiki/SQL_Server_Integration_Services
Informatica PowerCenter
A widely used extraction, transformation and loading (ETL) tool used in building enterprise data warehouses.
Informatica PowerCenter is a widely used extraction, transformation and loading (ETL) tool used in building enterprise data warehouses. The components within Informatica PowerCenter aid in extracting data from its source, transforming it as per business requirements and loading it into a target data warehouse.
Links:
- https://www.informatica.com/nl/products/data-integration/powercenter.html
Talend Integration Suite
An open and scalable data integration and data quality solution for integrating, cleansing and profiling all corporate data.
Talend Data Integration is an open and scalable data integration and data quality solution for integrating, cleansing and profiling all corporate data.
Links:
- https://www.talend.com/
- https://en.wikipedia.org/wiki/Talend
Oracle Data Integrator (ODI)
An Extract, load and transform (ELT) (in contrast with the ETL common approach) tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.
Oracle Data Integrator (ODI) is an Extract, load and transform (ELT) (in contrast with the ETL common approach) tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.
Links:
- https://en.wikipedia.org/wiki/Oracle_Data_Integrator
Oracle Warehouse Builder (OWB)
An ETL tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.
Oracle Warehouse Builder (OWB) is an ETL tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.
Links:
- https://en.wikipedia.org/wiki/Oracle_Warehouse_Builder
Pentaho Data Integration (PDI)
Open source products which provide data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities.
Pentaho is a business intelligence (BI) software company that offers open source products which provide data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. It is headquartered in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015. On September 19, 2017, Pentaho became part of Hitachi Vantara, a new company that unifies the operations of Pentaho, Hitachi Data Systems and Hitachi Insight Group.

Links:
- https://en.wikipedia.org/wiki/Pentaho

PROCESSING & ANALYTICS

Visualization & Reporting

BI Platforms (10)

Qlik
Facilitates creating visualizations, dashboards, and apps.
Qlik (original PC-based desktop tool was called QuikView) facilitates creating visualizations, dashboards, and apps. 'Quik' stood for 'Quality, Understanding, Interaction, Knowledge.'
Links:
- https://en.wikipedia.org/wiki/Qlik
Tableau
Produces interactive data visualization products focused on business intelligence.
Tableau Software (/tæbˈloʊ/ tab-loh) produces interactive data visualization products focused on business intelligence.
Links:
- https://en.wikipedia.org/wiki/Tableau_Software
Microstrategy
Delivers reports and dashboards, and enables users to conduct ad hoc analysis and share insights via mobile devices or the Web.
MicroStrategy Analytics allows large organizations to analyze vast amounts of data and securely distribute actionable business insight throughout an enterprise, while also being able to cater to smaller workgroups and departmental use via MicroStrategy Desktop. MicroStrategy Analytics delivers reports and dashboards, and enables users to conduct ad hoc analysis and share insights via mobile devices (via MicroStrategy Mobile) or the Web (via MicroStrategy Web).
Links:
- https://en.wikipedia.org/wiki/MicroStrategy
JasperSoft
An open source Java reporting tool that can write to a variety of targets, such as: screen, a printer, into PDF, HTML, Microsoft Excel, RTF, ODT, Comma-separated values or XML files.
JasperReports is an open source Java reporting tool that can write to a variety of targets, such as: screen, a printer, into PDF, HTML, Microsoft Excel, RTF, ODT, Comma-separated values or XML files.
Links:
- https://www.jaspersoft.com/
- https://en.wikipedia.org/wiki/JasperReports
Microsoft Power BI
A business analytics service provided by Microsoft.
Power BI is a business analytics service provided by Microsoft. It provides interactive visualizations with self-service business intelligence capabilities, where end users can create reports and dashboards by themselves, without having to depend on information technology staff or database administrators.
Links:
- https://en.wikipedia.org/wiki/Power_BI
- https://powerbi.microsoft.com/en-us/
Oracle Business Intelligence (OBIEE)
A set of business intelligence tools consisting of former Siebel Systems business intelligence and Hyperion Solutions business intelligence offerings.
Oracle Business Intelligence Enterprise Edition Plus, also termed as theOBI EE Plus, is Oracle Corporation's set of business intelligence tools consisting of former Siebel Systems business intelligence and Hyperion Solutions business intelligence offerings.

The industry counterpart and main competitors of OBIEE are Microsoft BI, TIBCO Spotfire, IBM Cognos, SAP AG Business Objects and SAS Institute Inc. The products currently leverage a common BI Server providing integration among the tools.

Links:
- https://en.wikipedia.org/wiki/Oracle_Business_Intelligence_Suite_Enterprise_Edition
Salesforce Wave Analytics
A business intelligence, data visualizations and data analytcs.
Wave Analytics is business intelligence, data visualizations and data analytcs.
Links:
TIBCO Spotfire
A smart, secure, governed, enterprise-class analytics platform with built-in data wrangling that delivers AI-driven, visual, geo, and streaming analytics.
TIBCO Spotfire is a smart, secure, governed, enterprise-class analytics platform with built-in data wrangling that delivers AI-driven, visual, geo, and streaming analytics.
Links:
- https://en.wikipedia.org/wiki/Spotfire
SAP Business Objects
SAP BusinessObjects (BO or BOBJ), SAP's business intelligence (BI) platform.
SAP BusinessObjects (BO or BOBJ) is an enterprise software company, specializing in business intelligence (BI). BusinessObjects was acquired in 2007 by German company SAP AG. The company claimed more than 46,000 customers in its final earnings release prior to being acquired by SAP. Its flagship product is BusinessObjects XI, with components that provide performance management, planning, reporting, query and analysis, and enterprise information management. BusinessObjects also offers consulting and education services to help customers deploy its business intelligence projects. Other toolsets enable universes (the BusinessObjects name for a semantic layer between the physical data store and the front-end reporting tool) and ready-written reports to be stored centrally and made selectively available to communities of the users.

Links:
- https://en.wikipedia.org/wiki/BusinessObjects
IBM Cognos Analytics
An interactive way for virtually anyone to find, explore, and share data-driven insights in a governed environment.
"IBM Cognos Analytics, an interactive way for virtually anyone to find, explore, and share data-driven insights in a governed environment. Find precise and timely answers from your data or from content built by others. Create compelling reports and dashboards which you can easily distribute throughout your company. Use automated alerts to monitor changes to key findings. Confidently and quickly take actions to improve your business."
Links:
- https://www.ibm.com/products/cognos-analytics

Interactive Dashboards (3)

Splunk
Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
Splunk (the product) captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
Links:
- https://en.wikipedia.org/wiki/Splunk
Kibana
An open source data visualization plugin for Elasticsearch.
Kibana is an open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.
Links:
- https://en.wikipedia.org/wiki/Kibana
Zoomdata
A data visualization and analytics tool that allows customers to explore and analyze the vast quantities of data in their datastores.
Zoomdata markets a data visualization and analytics tool that allows customers to explore and analyze the vast quantities of data in their datastores. The product is different from other tools in the industry due to a patent the company holds around 'Data Sharpening'. The approach involves returning the results of a query that is run instantly, while the image ‘sharpens’ and becomes clearer as more data is processed. The product includes a connector studio that directly connects to database, search, streaming, flat file and in-memory data sources. Matt Asay of readwrite.com compared it watching a streaming movie, where you see some results immediately, soon followed by the whole.
Links:
- https://en.wikipedia.org/wiki/Zoomdata

Graphical Libraries (2)

D3.js
A JavaScript library for manipulating documents based on data.
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
Links:
- https://d3js.org/
GoJS
A feature-rich JavaScript library for implementing custom interactive diagrams and complex visualizations across modern web browsers and platforms.
GoJS is a feature-rich JavaScript library for implementing custom interactive diagrams and complex visualizations across modern web browsers and platforms. GoJS makes constructing JavaScript diagrams of complex nodes, links, and groups easy with customizable templates and layouts.
Links:
- https://gojs.net/latest/index.html

Search & Query

Interactive Query Engines (3)

Impala
Massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.
Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
Links:
- https://impala.incubator.apache.org/
- https://en.wikipedia.org/wiki/Cloudera_Impala
Apache Hive
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Links:
- https://en.wikipedia.org/wiki/Apache_Hive
Spark SQL
Spark SQL is Apache Spark's module for working with structured data.
Spark SQL is Apache Spark's module for working with structured data. Seamlessly mix SQL queries with Spark programs. Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R.
Links:
- http://spark.apache.org/sql/

Distributed Query Engines (4)

Splunk
Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
Splunk (the product) captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
Links:
- https://en.wikipedia.org/wiki/Splunk
Elasticsearch
Elasticsearch is a distributed, JSON-based search and analytics engine designed for horizontal scalability, maximum reliability, and easy management.
Elasticsearch is a distributed, JSON-based search and analytics engine designed for horizontal scalability, maximum reliability, and easy management.
Links:
- https://www.elastic.co/products/elasticsearch
Apache Solr
Highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
Links:
- http://lucene.apache.org/solr/
Apache Lucene
A free and open-source information retrieval software library.
Apache Lucene is a free and open-source information retrieval software library, originally written completely in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Lucene has been ported to other programming languages including Object Pascal, Perl, C#, C++, Python, Ruby and PHP.

Links:
- https://en.wikipedia.org/wiki/Apache_Lucene

Processing

Distributed Computing Engines (3)

Hadoop MapReduce
A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Links:
- https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Apache Spark
An application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.
Links:
- http://spark.apache.org/
Apache Tez
Aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
The Apache Tez™ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
Links:
- https://tez.apache.org/

Event Stream Processors (4)

A set of technologies designed to assist the construction of event-driven information system, including visualization, databases, middleware, processing languages.

Apache Storm
A distributed stream processing computation framework written in Clojure.
Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter.
Links:
- http://storm.apache.org/
Spark Streaming
Facilitates building scalable fault-tolerant streaming applications.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Links:
- http://spark.apache.org/streaming/
[AWS] Kinesis
Continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events.
Amazon Kinesis Streams enables you to build custom applications that process or analyze streaming data for specialized needs. Amazon Kinesis Streams can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events.
Links:
- https://aws.amazon.com/kinesis/streams/
Sazma
An open-source near-realtime, asynchronous computational framework for stream processing developed in Scala and Java.
Apache Samza is an open-source near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java.
Links:
- https://en.wikipedia.org/wiki/Apache_Samza

Data Processing Frameworks (3)

Cascading
The proven application development platform for building data applications on Hadoop.
Cascading is the proven application development platform for building data applications on Hadoop.
Links:
- http://www.cascading.org/
Apache Hive
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Links:
- https://en.wikipedia.org/wiki/Apache_Hive
Apache Pig
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Links:
- https://pig.apache.org/

	PostgreSQL
	MySQL
	Microsoft SQL Server
	Adaptive Server Enterprise
	IBM DB2
	Oracle RDBMS

	Oracle Data Integrator (ODI)
	Microsoft SSIS
	Talend Integration Suite
	Pentaho Data Integration (PDI)
	Oracle Warehouse Builder (OWB)
	Informatica PowerCenter

	IBM Cognos Analytics
	Salesforce Wave Analytics
	Microsoft Power BI
	TIBCO Spotfire
	SAP Business Objects
	Tableau
	JasperSoft
	Oracle Business Intelligence (OBIEE)
	Qlik
	Microstrategy

	Apache Hadoop
	Apache Cassandra FS

	Riak
	Redis
	Oracle Berkley DB

	MongoDB
	Apache CouchDB

	Apache HBase
	Apache Cassandra

Big Data Analytics

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Links:

Classical ETL BI Architecture

Modern Big Data Analytics Architecture

Lambda Architecture

Message Integration Patterns

Hadoop MapReduce Illustration

Hadoop Ecosystem

Hadoop Illustration (kitchen)

Example Big Data Architecture

Kafka Architecture

Kafka Primer

Big Data Messaging with Kafka

Sazma vs. Hadoop