Michael Nguyen

Data Scientist

Logo

Data Science professional with a passion for creative problem solving through data analysis. Quick learner possessing extensive analytical skills, strong attention to detail, and a significant ability to work in team environments. Highly accurate and adept at collecting, analyzing, and interpreting large datasets, developing new forecasting models, and performing data management tasks for data driven decisions and products.

Currently seeking internships and full time opportunities. Graduated with a Master of Engineering degree in Data Science at University of Toronto.

View My LinkedIn Profile

K-means Clustering on Hadoop MapReduce

GitHub Repository Link: https://github.com/mikonguyen/K-means-on-MapReduce

Overview

The goal of this project was to implement K-means clustering algorithm from scratch on Hadoop MapReduce for big data analytics. The program was developed using Java.

K-means algorithm is the most well-known and commonly used clustering method.

It takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high whereas the inter-cluster similarity is low.

Cluster similarity is measured according to the mean value of the objects in the cluster, which can be regarded as the cluster’s ‘center of gravity’.

The algorithm proceeds as follows:

Tools Utilised