Distributed Processing Frameworks like Hadoop has ignited the new path of processing the Big Data. Below are the Tools and Technologies covered in this highly practical subjects
Hadoop
- Learned Distributed Computing paradigm
- frameworks and languages Used: MrJob, Spark, Hive, HiveQL
Hive
- learned the advanced concepts like how ORC file format can be more efficient file format in Hive
- Its integration with external tables, S3 etc..
- Concept of partitioning make the query processing so fast
- Some concept specific to Hive
- like we create the schema explicitly before loading the data in database’
Various Databases
I learned and implemented below listed databases in this course. Finally, in a group we created
- CassandraDB
- MongoDB
- DynamoDB
Also, we learned about the streaming frameworks like Kafka
Small Projects done:
- Analyzing the employee data and efficiently processing large data by performing partitioning
- Used AWS EMR to work with various Hadoop technologies including spark, mrJob etc..
- Finally played around with different NoSQL databases offered by AWS Cloud on Employee Data. Created a project report (kind of case study) on the this experimentation and submitted to the college