Department of Computer Science and Engineering

B.Tech. III (CO) Semester - 6

L

T

P

C

CO314 : DATA SCIENCE (EIS-II)

3

0

0

3

COURSE OBJECTIVES
  • To teach the fundamentals of data analytics and the data science pipeline.
  • Enable students to apply statistical methods, regression techniques, and machine learning algorithms to make sense out of data sets both large and small.
  • Explain various Data Visualization techniques and their applications.
  • COURSE OUTCOMES
    After successful completion of this course, student will be able to
    • Analyse Large Scale Data using Hadoop and Machine Learning.
    • Work with Hadoop Mappers and Reducers to analyze data.
    • Gain insight into data visualization and optimization techniques.
    COURSE CONTENT
    INTRODUCTION TO PARADIGMS FOR DATA MANIPULATION, LARGE SCALE DATA SETS

    (14 Hours)

    MapReduce (Hadoop) and software interfaces (e.g., hive, pig): Moving from traditional warehouses to map reduce. Distributed databases and distributed hash tables, near-real-tips query.

    LARGE-SCALE ITERATIVE ALGORITHMS

    (16 Hours)

    ML at large scale (distributed supervised and unsupervised learning).

    Feature hashing

    Topic models (LDA)

    Large scale SVD and NMF for spectral clustering

    Inverted-index and LSH based clustering

    Large scale k-means clustering

    VISUALIZATION

    (08 Hours)

    Graph visualization

    Data summaries

    Hypothesis testing, ML model-checking and comparison

    ADVANCED TOPICS

    (04 Hours)

    (Total Contact Time: 42 Hours)
    BOOKS RECOMMENDED
    1. Bekkerman et al. Scaling up Machine Learning
    2. Tom White, "Hadoop: The Definitive Guide" Third Edition, O,reilly Media, 2012.
    3. Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Datasets", Cambridge University Press, 2012.
    4. Vincent Granville, "Developing Analytic Talent: Becoming a Data Scientist", wiley, 2014.
    5. Jeffrey Stanton & Robert De Graaf, Introduction To Data Science, Version 2.0, 2013.