Big data has arrived. At BIDMC, I oversee 1.5 petabytes of clinical and administrative data. At HMS, I oversee nearly 3 petabytes of research data.
As Blackberry's recent outage illustrates depending on single monolithic infrastructure has its risks and impact of failure can be enormous.
How can we leverage commodity hardware infrastructure, reduce risk, and meet user demands for mining big data? Apache Hadoop is a cool technology worth knowing about.
Hadoop is an open source framework that allows for the distributed processing of large data sets across clusters of computers, designed to scale from a single server to thousands of machines. Rather than rely on hardware to deliver high-availability, Hadoop detects failures and automatically finds redundant copies of data. The Hadoop library includes
*The Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster.
*MapReduce, a parallel distributed processing system that takes advantage of the distribution and replication of data in HDFS to spread execution of any job across many nodes in a cluster.
Microsoft has just introduced support for Hadoop into SQL Server 12 as part of their end-to-end Big Data roadmap.
A fault tolerant distributed file system using commodity hardware for big data that is even integrated into mainstream data mining tools like SQL Server. That's cool!
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment