Analytics Camp 2012 Big Data Intro and Roundtable
Tim Ross is giving an intro on big data systems such as Hadoop and other big data systems. Plus NoSQL systems.
Hadoop was developed at Google to deal with distributed data systems.
Apache Hadoop is a distributed framework for data processing and storage.
Google’s Map-Reduce paper describes the process. Tim explained how the map-reduce framework works by distributing the data across a number of lower powered commodity servers. Map-reduce processes the data on each server and collapses it to another server which then sends to the consumer of the data. Splitting up the data allows much faster, concurrent data access.
Hadoop is written in Java. Is was opensourced. Many use cases are batch.
HBase is nosql database that is gaining traction. Used by Google. Partition tolerant. Can be used with Hadoop.
HPCC Lexus/Nexus opensourced HPCC, marketed as a hadoop killer.
Conrad evaluated Pentaho. Analysis and reporting tools. Frontend tools resource. Focuses on analysis. Strength is building the cubes for processing. Uses MBX language (sort of like SQL) for processing.
Business Intelligence was mentioned.
Tim used a presentation he did in the past on Hadoop. When you write it, you write a mapper and a reducer. He used it on a genomics project.
IRODS was mentioned.
Tim showed the Yahoo! Developer Network site “Module 2: The Hadoop Distributed File System“.
Column oriented RDBMS systems mentioned.