Big Data Resources

Big Data News http://www.bigdatanews.com

Big Data CloudU  http://cloudu.rackspace.com

Coursera https://www.coursera.org

 

Awesome Big Data

A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-phpawesome-pythonawesome-rubyhadoopecosystemtable & big-data.

Your contributions are always welcome!

Frameworks

  • Apache Hadoop – framework for distributed processing. Integrated MapReduce, YARN and HDFS.

Distributed Programming

  • AddThis Hydra – distributed data processing and storage system.
  • AMPLab SIMR – run Spark on Hadoop MapReduce v1.
  • Apache Crunch – a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  • Apache DataFu – collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  • Apache Gora – framework for in-memory data model and persistence.
  • Apache Hama – BSP (Bulk Synchronous Parallel) computing framework.
  • Apache MapReduce – programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache Pig – high level language to express data analysis programs for Hadoop.
  • Apache S4 – framework for stream processing, implementation of S4.
  • Apache Spark – framework for in-memory cluster computing.
  • Apache Spark Streaming – framework for stream processing, part of Spark.
  • Apache Storm – framework for stream processing by Twitter also on YARN.
  • Apache Tez – application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  • Apache Twill – abstraction over YARN that reduces the complexity of developing distributed applications.
  • Cascalog – data processing and querying library.
  • Cheetah – High Performance, Custom Data Warehouse on Top of MapReduce.
  • Concurrent Cascading – framework for data management/analytics on Hadoop.
  • Damballa Parkour – MapReduce library for Clojure.
  • Datasalt Pangool – alternative MapReduce paradigm.
  • DataTorrent StrAM – real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance..
  • Facebook Corona – Hadoop enhancement which removes single point of failure.
  • Facebook Peregrine – Map Reduce framework.
  • Facebook Scuba – distributed in-memory datastore.
  • Google MapReduce – map reduce framework.
  • Google MillWheel – fault tolerant stream processing framework.
  • HadoopDB – hybrid of MapReduce and DBMS.
  • JAQL – declarative programming language for working with structured, semi-structured and unstructured data.
  • Kite – is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  • Metamarkers Druid – framework for real-time analysis of large datasets.
  • Netflix PigPen – map-reduce for Clojure whiche compiles to Apache Pig.
  • Nokia Disco – MapReduce framework developed by Nokia.
  • Pydoop – Python MapReduce and HDFS API for Hadoop.
  • Stratosphere – general purpose cluster computing framework.
  • Twitter Scalding – Scala library for Map Reduce jobs, built on Cascading.
  • Twitter Summingbird – Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Filesystem

Column Data Model

  • Actian Vector – column-oriented analytic database.
  • Apache Accumulo – distribuited key/value store, built on Hadoop.
  • Apache Cassandra – column-oriented distribuited datastore, inspired by BigTable.
  • Apache HBase – column-oriented distribuited datastore, inspired by BigTable.
  • C-Store – column oriented DBMS.
  • Facebook HydraBase – evolution of HBase made by Facebook.
  • Google BigTable – column-oriented distributed datastore.
  • Google Cloud Datastore – is a fully managed, schemaless database for storing non-relational data over BigTable
  • Hypertable – column-oriented distribuited datastore, inspired by BigTable.
  • InfiniDB – is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
  • MonetDB – column store database.
  • OhmData C5 – improved version of HBase.
  • Parquet – columnar storage format for Hadoop.
  • Twitter Manhattan – real-time, multi-tenant distributed database for Twitter scale.
  • Vertica – is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Document Data Model

  • Crate Data – is an open source massively scalable data store. It requires zero administration.
  • Facebook Apollo – Facebook’s Paxos-like NoSQL database.
  • jumboDB – document oriented datastore over Hadoop.
  • LinkedIn Espresso – horizontally scalable document-oriented NoSQL data store.
  • MarkLogic – Schema-agnostic Enterprise NoSQL database technology.
  • MongoDB – Document-oriented database system.
  • RethinkDB – document database that supports queries like table joins and group by.

Key-value Data Model

  • Amazon DynamoDB – distributed key/value store, implementation of Dynamo paper.
  • Edis – is a protocol-compatible Server replacement for Redis.
  • ElephantDB – Distributed database specialized in exporting data from Hadoop.
  • EventStore – distributed time series database.
  • LinkedIn Krati – is a simple persistent data store with very low latency and high throughput.
  • Linkedin Voldemort – distributed key/value storage system.
  • OpenTSDB – distributed time series database on top of HBase.
  • Redis – in memory key value datastore.
  • Riak – a decentralized datastore.
  • Storehaus – library to work with asynchronous key value stores, by Twitter.
  • Tarantool – an efficient NoSQL database and a Lua application server.

Graph Data Model

  • Apache Giraph – implementation of Pregel, based on Hadoop.
  • Apache Spark Bagel – implementation of Pregel, part of Spark.
  • ArangoDB – multi model distribuited database.
  • Facebook TAO – TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
  • Gremlin – graph traversal Language.
  • Google Cayley – open-source graph database.
  • Google Pregel – graph processing framework.
  • GraphLab PowerGraph – a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  • GraphX – resilient Distributed Graph System on Spark.
  • Intel GraphBuilder – tools to construct large-scale graphs on top of Hadoop.
  • Neo4j – graph database writting entirely in Java.
  • OrientDB – document and graph database.
  • Phoebus – framework for large scale graph processing.
  • Titan – distributed graph database, built over Cassandra.
  • Twitter FlockDB – distribuited graph database.

NewSQL Databases

  • Amazon RedShift – data warehouse service, based on PostgreSQL.
  • BayesDB – statistic oriented SQL database.
  • FoundationDB – distributed database, inspired by F1.
  • Google F1 – distributed SQL database built on Spanner.
  • Google Spanner – globally distributed semi-relational database.
  • H-Store – is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  • Haeinsa – linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
  • HandlerSocket – NoSQL plugin for MySQL/MariaDB.
  • InfiniSQL – infinity scalable RDBMS.
  • MemSQL – in memory SQL database witho optimized columnar storage on flash.
  • NuoDB – SQL/ACID compliant distributed database.
  • Postgres-XL – Scalable Open Source PostgreSQL-based Database Cluster.
  • SAP HANA – SQL based in-memory database.
  • SenseiDB – distributed, realtime, semi-structured database.
  • Sky – database used for flexible, high performance analysis of behavioral data.
  • SymmetricDS – open source software for both file and database synchronization.

Time-Series Databases

  • TempoDB – Cloud-based
  • InfluxDB – Open-source distributed time series database
  • OpenTSDB – uses HBase
  • Kairosdb – similar to OpenTSDB but allows for Cassandra
  • Cube – uses MongoDB to store time series data

SQL-like processing

Data Ingestion

Integrated Development Environments

Service Programming

  • Akka Toolkit – runtime for distributed, and fault tolerant event-driven applications on the JVM.
  • Apache Avro – data serialization system.
  • Apache Curator – Java libaries for Apache ZooKeeper.
  • Apache Karaf – OSGi runtime that runs on top of any OSGi framework.
  • Apache Thrift – framework to build binary protocols.
  • Apache Zookeeper – centralized service for process management.
  • Google Chubby – a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert – cluster manager.
  • OpenMPI – message passing framework.
  • Serf – decentralized solution for service discovery and orchestration.
  • Spring XD – distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
  • Twitter Elephant Bird – libraries for working with LZOP-compressed data.
  • Twitter Finagle – asynchronous network stack for the JVM.

Scheduling

Machine Learning

  • Apache Mahout – machine learning library for Hadoop.
  • brain – Neural networks in JavaScript.
  • Cloudera Oryx – real-time large-scale machine learning.
  • Concurrent Pattern – machine learning library for Cascading.
  • convnetjs – Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
  • Decider – Flexible and Extensible Machine Learning in Ruby.
  • etcML – text classification with machine learning.
  • Etsy Conjecture – scalable Machine Learning in Scalding.
  • H2O – statistical, machine learning and math runtime for Hadoop.
  • MLbase – distributed machine learning libraries for the BDAS stack.
  • MLPNeuralNet – Fast multilayer perceptron neural network library for iOS and Mac OS X.
  • nupic – Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  • PredictionIO – machine learning server buit on Hadoop, Mahout and Cascading.
  • scikit-learn – scikit-learn: machine learning in Python.
  • Spark MLlib – a Spark implementation of some common machine learning (ML) functionality.
  • Vowpal Wabbit – learning system sponsored by Microsoft and Yahoo!.
  • WEKA – suite of machine learning software.

Benchmarking

Security

System Deployment

  • Apache Ambari – operational framework for Hadoop mangement.
  • Apache Bigtop – system deployment framework for the Hadoop ecosystem.
  • Apache Helix – cluster management framework.
  • Apache Mesos – cluster manager.
  • Apache Slider – is a YARN application to deploy existing distributed applications on YARN.
  • Apache Whirr – set of libraries for running cloud services.
  • Apache YARN – Cluster manager.
  • Brooklyn – library that simplifies application deployment and management.
  • Buildoop – Similar to Apache BigTop based on Groovy language.
  • Cloudera HUE – web application for interacting with Hadoop.
  • Facebook Prism – multi datacenters replication system.
  • Google Borg – job scheduling and monitoring system.
  • Google Omega – job scheduling and monitoring system.
  • Hortonworks HOYA – application that can deploy HBase cluster on YARN.
  • Marathon – Mesos framework for long-running services.

Applications

  • Apache Kiji – framework to collect and analyze data in real-time, based on HBas.
  • Apache Nutch – open source web crawler.
  • Apache OODT – capturing, processing and sharing of data for NASA’s scientific archives.
  • Apache Tika – content analysis toolkit.
  • Eclipse BIRT – Eclipse-based reporting system.
  • Eventhub – open source event analytics platform.
  • HIPI Library – API for performing image processing tasks on Hadoop’s MapReduce.
  • Hunk – Splunk analytics for Hadoop.
  • MADlib – data-processing library of an RDBMS to analyze data.
  • PivotalR – R on Pivotal HD / HAWQ and PostgreSQL.
  • Qubole – auto-scaling Hadoop cluster, built-in data connectors.
  • Snowplow – enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
  • SparkR – R frontend for Spark.
  • Splunk – analyzer for machine-generated date.
  • Talend – unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

MySQL forks and evolutions

  • Amazon RDS – MySQL databases in Amazon’s cloud.
  • Drizzle – evolution of MySQL 6.0.
  • Google Cloud SQL – MySQL databases in Google’s cloud.
  • MariaDB – enhanced, drop-in replacement for MySQL.
  • MySQL Cluster – MySQL implementation using NDB Cluster storage engine.
  • Percona Server – enhanced, drop-in replacement for MySQL.
  • ProxySQL – High Performance Proxy for MySQL.
  • TokuDB – TokuDB is a storage engine for MySQL and MariaDB.
  • WebScaleSQL – is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

Memcached forks and evolutions

Embedded Databases

  • BerkeleyDB – a software library that provides a high-performance embedded database for key/value data.
  • HanoiDB – Erlang LSM BTree Storage.
  • LevelDB – a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
  • LMDB – ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  • RocksDB – embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

  • Jaspersoft – powerful business intelligence suite.
  • Jedox Palo – customisable business intelligence platform.
  • Microsoft – business intelligence software and platform.
  • Microstrategy – software platforms for business intelligence, mobile intelligence, and network applications.
  • Pentaho – business intelligence platform.
  • Qlik – business intelligence and analytics platform.
  • Tableau – business intelligence platform.
  • Spango BI – open source business intelligence platform.

Data Visualization

  • Arbor – graph visualization library using web workers and jQuery.
  • Chart.js – open source HTML5 Charts visualizations.
  • Cubism – JavaScript library for time series visualization.
  • D3 – javaScript library for manipulating documents.
  • Envisionjs – dynamic HTML5 visualization.
  • Grafana – graphite dashboard frontend, editor and graph composer.
  • Graphite – scalable Realtime Graphing.
  • Google Charts – simple charting API.
  • Highcharts – simple and flexible charting API.
  • Matplotlib – plotting with Python.
  • NVD3 – chart components for d3.js.
  • Peity – Progressive bar, line and pie charts.
  • Recline – simple but powerful library for building data applications in pure Javascript and HTML.
  • Sigma.js – JavaScript library dedicated to graph drawing.
  • Vega – a visualization grammar.

Interesting Readings

  • Big Data Benchmark – Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
  • NoSQL Comparison – Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.

Interesting Papers

2013 – 2014

  • 2013 – AMPLab – Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
  • 2013 – AMPLab – MLbase: A Distributed Machine-learning System.
  • 2013 – AMPLab – Shark: SQL and Rich Analytics at Scale.
  • 2013 – AMPLab – GraphX: A Resilient Distributed Graph System on Spark.
  • 2013 – Google – HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  • 2013 – Microsoft – Scalable Progressive Analytics on Big Data in the Cloud.
  • 2013 – Metamarkets – Druid: A Real-time Analytical Data Store.
  • 2013 – Google – Online, Asynchronous Schema Change in F1.
  • 2013 – Google – F1: A Distributed SQL Database That Scales.
  • 2013 – Google – MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
  • 2013 – Facebook – Scuba: Diving into Data at Facebook.
  • 2013 – Facebook – Unicorn: A System for Searching the Social Graph.
  • 2013 – Facebook – Scaling Memcache at Facebook.

2011 – 2012

  • 2012 – AMPLab – Blink and It’s Done: Interactive Queries on Very Large Data.
  • 2012 – AMPLab – Fast and Interactive Analytics over Hadoop Data with Spark.
  • 2012 – AMPLab – Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
  • 2012 – Microsoft – Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
  • 2012 – Microsoft – Paxos Made Parallel.
  • 2012 – AMPLab – BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  • 2012 – Google – Processing a trillion cells per mouse click.
  • 2012 – Google – Spanner: Google’s Globally-Distributed Database.
  • 2011 – AMPLab – Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
  • 2011 – AMPLab – Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
  • 2011 – Google – Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 – 2010

  • 2010 – Facebook – Finding a needle in Haystack: Facebook’s photo storage.
  • 2010 – AMPLab – Spark: Cluster Computing with Working Sets.
  • 2010 – Google – Storage Architecture and Challenges.
  • 2010 – Google – Pregel: A System for Large-Scale Graph Processing.
  • 2010 – Google – Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
  • 2010 – Google – Dremel: Interactive Analysis of Web-Scale Datasets.
  • 2010 – Yahoo – S4: Distributed Stream Computing Platform.
  • 2009 – HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  • 2008 – AMPLab – Chukwa: A large-scale monitoring system.
  • 2007 – Amazon – Dynamo: Amazon’s Highly Available Key-value Store.
  • 2006 – Google – The Chubby lock service for loosely-coupled distributed systems.
  • 2006 – Google – Bigtable: A Distributed Storage System for Structured Data.
  • 2004 – Google – MapReduce: Simplied Data Processing on Large Clusters.
  • 2003 – Google – The Google File System.