CSP 554: Big Data Technologies

Distributed Processing Frameworks like Hadoop has ignited the new path of processing the Big Data. Below are the Tools and Technologies covered in this highly practical subjects

Hadoop

Learned Distributed Computing paradigm
frameworks and languages Used: MrJob, Spark, Hive, HiveQL

Hive

learned the advanced concepts like how ORC file format can be more efficient file format in Hive
Its integration with external tables, S3 etc..
Concept of partitioning make the query processing so fast
Some concept specific to Hive
- like we create the schema explicitly before loading the data in database’

Various Databases

I learned and implemented below listed databases in this course. Finally, in a group we created

CassandraDB
MongoDB
DynamoDB

Also, we learned about the streaming frameworks like Kafka

Small Projects done:

Analyzing the employee data and efficiently processing large data by performing partitioning
Used AWS EMR to work with various Hadoop technologies including spark, mrJob etc..
Finally played around with different NoSQL databases offered by AWS Cloud on Employee Data. Created a project report (kind of case study) on the this experimentation and submitted to the college

Hadoop#

Hive#

Various Databases#

Small Projects done:#

Hadoop

Hive

Various Databases

Small Projects done: