Big Data Analytics

Big data analytics is the process of collecting, organizing and analyzing large sets of data (called big data) to discover patterns and other useful information.

DATA STORAGES
Distributes File Systems (2)
Commonly known as network file systems, allows files to be accessed using the same interfaces and semantics as local files – for example, mounting/unmounting, listing directories, read/write at byte boundaries, system's native permission model.
  • Apache Hadoop
    A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • Apache Cassandra FS
    A FS of Cassandra DB, designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
NoSQL Databases
Provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases
Key-Value (3)
A data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary or hash.
  • Riak
    A distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity, and scalability.
  • Redis
    Networked, in-memory, and stores keys with optional durability.
  • Oracle Berkley DB
    A software library intended to provide a high-performance embedded database for key/value data.
Document-Oriented (2)
Systems for storing, retrieving and managing document-oriented information, also known as semi-structured data.
  • MongoDB
    A free and open-source cross-platform document-oriented database program.
  • Apache CouchDB
    A document-oriented NoSQL database architecture, implemented in the concurrency-oriented language Erlang; uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
Column-Family (2)
Use a concept called a keyspace, containing all the column families, which contain rows, which contain columns.
  • Apache HBase
    An open source, non-relational, distributed database modeled after Google's Bigtable and is written in Java.
  • Apache Cassandra
    A free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Graph-Oriented (2)
Uses graph structures for semantic queries with nodes, edges and properties to represent and store data.
  • Neo4J
    An ACID-compliant transactional database with native graph storage and processing.
  • OrientDB
    A multi-model database, supporting graph, document, key/value, and object models, but the relationships are managed as in graph databases.
Analytic RDBMS
MPP Analytic RDBMS (4)
Use multi-core processors, multiple processors and servers, and storage appliances equipped for parallel processing, enables reading many pieces of data across many processing units at the same time for enhanced speed.
  • HP Vertica
    An analytic database management software company.
  • Teradata
    A fully scalable relational database management system produced, widely used to manage large data warehousing operations.
  • Microsoft APS
    A massively parallel processing (MPP) SQL Server appliance.
  • [AWS] Redshift
    A hosted data warehouse product, built on top of technology from the massive parallel processing (MPP) data-warehouse company ParAccel.
Traditional Analytic RDBMS (6)
  • Microsoft SQL Server
    A relational database management system developed by Microsoft.
  • Oracle RDBMS
    An object-relational database management system produced and marketed by Oracle Corporation.
  • IBM DB2
    Database server products developed by IBM.
  • MySQL
    An open-source relational database management system (RDBMS).
  • PostgreSQL
    An object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance.
  • Adaptive Server Enterprise
    A relational model database server product for businesses developed by Sybase Corporation.
INTEGRATION
Messaging
Use messaging to transfer packets of data frequently, immediately, reliably, and asynchronously, using customizable formats.
Data Collectors (3)
  • Apache Flume
    A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
  • Logstash
    An open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite “stash.”
  • Fluentd
    An open source data collector for unified logging layer.
Distributed Message Brokers (4)
ETL/ELT (6)
  • Microsoft SSIS
    A component of the Microsoft SQL Server, a platform for data integration and workflow applications.
  • Informatica PowerCenter
    A widely used extraction, transformation and loading (ETL) tool used in building enterprise data warehouses.
  • Talend Integration Suite
    An open and scalable data integration and data quality solution for integrating, cleansing and profiling all corporate data.
  • Oracle Data Integrator (ODI)
    An Extract, load and transform (ELT) (in contrast with the ETL common approach) tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.
  • Oracle Warehouse Builder (OWB)
    An ETL tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.
  • Pentaho Data Integration (PDI)
    Open source products which provide data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities.
PROCESSING & ANALYTICS
Visualization & Reporting
BI Platforms (10)
  • Qlik
    Facilitates creating visualizations, dashboards, and apps.
  • Tableau
    Produces interactive data visualization products focused on business intelligence.
  • Microstrategy
    Delivers reports and dashboards, and enables users to conduct ad hoc analysis and share insights via mobile devices or the Web.
  • JasperSoft
    An open source Java reporting tool that can write to a variety of targets, such as: screen, a printer, into PDF, HTML, Microsoft Excel, RTF, ODT, Comma-separated values or XML files.
  • Microsoft Power BI
    A business analytics service provided by Microsoft.
  • Oracle Business Intelligence (OBIEE)
    A set of business intelligence tools consisting of former Siebel Systems business intelligence and Hyperion Solutions business intelligence offerings.
  • Salesforce Wave Analytics
    A business intelligence, data visualizations and data analytcs.
  • TIBCO Spotfire
    A smart, secure, governed, enterprise-class analytics platform with built-in data wrangling that delivers AI-driven, visual, geo, and streaming analytics.
  • SAP Business Objects
    SAP BusinessObjects (BO or BOBJ), SAP's business intelligence (BI) platform.
  • IBM Cognos Analytics
    An interactive way for virtually anyone to find, explore, and share data-driven insights in a governed environment.
Interactive Dashboards (3)
  • Splunk
    Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
  • Kibana
    An open source data visualization plugin for Elasticsearch.
  • Zoomdata
    A data visualization and analytics tool that allows customers to explore and analyze the vast quantities of data in their datastores.
Graphical Libraries (2)
  • D3.js
    A JavaScript library for manipulating documents based on data.
  • GoJS
    A feature-rich JavaScript library for implementing custom interactive diagrams and complex visualizations across modern web browsers and platforms. 
Search & Query
Interactive Query Engines (3)
  • Impala
    Massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.
  • Apache Hive
    A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. 
  • Spark SQL
    Spark SQL is Apache Spark's module for working with structured data.
Distributed Query Engines (4)
  • Splunk
    Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.
  • Elasticsearch
    Elasticsearch is a distributed, JSON-based search and analytics engine designed for horizontal scalability, maximum reliability, and easy management.
  • Apache Solr
    Highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
  • Apache Lucene
    A free and open-source information retrieval software library.
Processing
Distributed Computing Engines (3)
  • Hadoop MapReduce
    A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
  • Apache Spark
    An application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
  • Apache Tez
    Aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
Event Stream Processors (4)
A set of technologies designed to assist the construction of event-driven information system, including visualization, databases, middleware, processing languages.
  • Apache Storm
    A distributed stream processing computation framework written in Clojure.
  • Spark Streaming
    Facilitates building scalable fault-tolerant streaming applications.
  • [AWS] Kinesis
    Continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events.
  • Sazma
    An open-source near-realtime, asynchronous computational framework for stream processing developed in Scala and Java.
Data Processing Frameworks (3)
  • Cascading
    The proven application development platform for building data applications on Hadoop.
  • Apache Hive
    A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
  • Apache Pig
    A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.