Hadoop Interview Questions and Answers-1

1.What are real-time industry applications of Hadoop?
    Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today.
    Some of the instances where Hadoop is used:
    • Managing traffic on streets
    • Streaming processing
    • Content Management and Archiving Emails
    • Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster
    • Fraud detection and Prevention
    • Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data
    • Managing content, posts, images and videos on social media platforms
    • Analyzing customer data in real-time for improving business performance
    • Public sector fields such as intelligence, defense, cyber security and scientific research
    • Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction
    • Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data.

2.How is Hadoop different from other parallel computing systems?
Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. Go through this HDFS content to know how the distributed file system works. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.

3.What all modes Hadoop can be run in?
    Hadoop can run in three modes:
    1. Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
    2. Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
    3. Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

4.Explain the major difference between HDFS block and InputSplit.
    In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper. Suppose we have two blocks:
    Block 1: ii nntteell
    Block 2: Ii ppaatt
    Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block. It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.
    However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

5.What is distributed cache and what are its benefits?
    Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed.Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.
Benefits of using distributed cache are:
• It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
• Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currently.

6.Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
    NameNode is the core of HDFS that manages the metadata – the information of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for namespace:
    fsimage file- It keeps track of the latest checkpoint of the namespace.
    edits file-It is a log of changes that have been made to the namespace since checkpoint.
    Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
    There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
    Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.

7.What are the most common Input Formats in Hadoop?
    There are three most common input formats in Hadoop:
    • Text Input Format: Default input format in Hadoop.
    • Key Value Input Format: used for plain text files where the files are broken into lines
    • Sequence File Input Format: used for reading files in sequence
8.Define DataNode and how does NameNode tackle DataNode failures?
    DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.
    The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.
9.What are the core methods of a Reducer?
    The three core methods of a Reducer are:
    1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
    public void setup (context)
    2. reduce(): heart of the reducer always called once per key with the associated reduced task
    public void reduce(Key, Value, context)
    3. cleanup(): this method is called to clean temporary files, only once at the end of the task
    public void cleanup (context)
10.What is SequenceFile in Hadoop?
    Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:
    1. Uncompressed key/value records.
    2. Record compressed key/value records – only ‘values’ are compressed here.
    3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.
11.What is Job Tracker role in Hadoop?
    Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).
    • It is a process that runs on a separate node, not on a DataNode often
    • Job Tracker communicates with the NameNode to identify data location
    • Finds the best Task Tracker Nodes to execute tasks on given nodes
    • Monitors individual Task Trackers and submits the overall job back to the client.
    • It tracks the execution of MapReduce workloads local to the slave node.
12.What is the use of RecordReader in Hadoop?
    Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
    Row1: Welcome to
    Row2: Intellipaat
    It will be read as “Welcome to Intellipaat” using RecordReader.
13.What is Speculative Execution in Hadoop?
    One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.
    It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.
    Speculative execution is by default true in Hadoop. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
    JobConf options to false.
14.What happens if you try to run a Hadoop job with an output directory that is already present?
    It will throw an exception saying that the output file directory already exists. To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.  To delete the directory before running the job, you can use shell:Hadoop fs –rmr /path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);

15.How can you debug Hadoop code?
    First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.
    1. Run: “ps –ef | grep –I ResourceManager”
    and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
    2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
    3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
    4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.
16.How to configure Replication Factor in HDFS?
    hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
    You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely, you can also change the replication factor of all the files under a directory.
    [training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

17.How to compress mapper output but not the reducer output?
    To achieve this compression, you should set:
    conf.set(“mapreduce.map.output.compress”, true)
    conf.set(“mapreduce.output.fileoutputformat.compress”, false)
18.What is the difference between Map Side join and Reduce Side Join?
    Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.
19.How can you transfer data from Hive to HDFS?

By writing the query: hive> insert overwrite directory ‘/’ select * from emp;
You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

20.What companies use Hadoop, any idea?
Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis , Amazon, Netflix, Adobe, eBay, Spotify, Twitter, Adobe.

74 comments:

  1. Superb i really enjoyed very much with this article here. Really its a amazing article i had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article.

    Digital Marketing Training in Chennai

    Hadoop Training in Chennai

    ReplyDelete

  2. I have seen a lot of blogs and Info. on other Blogs and Web sites But in this Hadoop Blog Information is useful very thanks for sharing it........

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hadoop, This is having a huge set of Demand in the upcoming days As Data is being utilized more and Security issues and storage of Data will be remaining the set of things to be considered, So as to maintain Data, A knowledge on Hadoop is must and Well being to be Executed When I was having my PMP Training in Chennai, I was supposed to work on Data oriented Projects One among them is Hadoop and well it has a huge demand in the upcoming Updates, Thank you for providing the interview questions I hope more number of updates from you regarding Hadoop.

    ReplyDelete
  5. Australia Best Tutor is one of the best Online Assignment Help providers at an affordable price. Here All Learners or Students are getting best quality assignment help with reference and styles formatting.

    Visit us for more Information

    Australia Best Tutor
    Sydney, NSW, Australia
    Call @ +61-730-407-305
    Live Chat @ https://www.australiabesttutor.com




    Our Services

    Online assignment help Australia
    my assignment help Australia
    assignment help
    help with assignment
    Online instant assignment help
    Online Assignment help Services

    ReplyDelete
  6. It's really interesting blog Thanks for sharing Big Data Hadoop Online Training

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. If you are looking for a good quality Hadoop training in Hyderabad there are variety of institutes Hadoop is a combination of online running applications on a very huge scale built of commodity hardware.

    ReplyDelete
  9. Each department of CAD have specific programmes which, while completed could provide you with a recognisable qualification that could assist you get a job in anything design enterprise which you would really like.

    AutoCAD training in Noida

    AutoCAD training institute in Noida


    Best AutoCAD training institute in Noida

    ReplyDelete
  10. Cloud Computing Training In Noida
    Webtrackker is IT based company in many countries. Webtrackker will provide you a real time projects based training on Cloud Computing. If you are looking for the Cloud computing training in Noida then you can join the webtrackker technology.
    Cloud Computing Training In Noida , Cloud Computing Training center In Noida , Cloud Computing Training institute In Noida ,

    Company Address:
    Webtrackker Technology
    C- 67, Sector- 63, Noida
    Email: info@webtrackker.com
    Website: www.webtrackker.com
    http://webtrackker.com/Cloud-Computing-Training-Institutes-In-Noida.php

    ReplyDelete
  11. Video editing course in Noida
    Video editing training institute in Noida- Webtrackker Technology is and IT Training institute providing the Video editing course in Noida, FCP, Final Cut Pro Training in Noida. For more call us- 8802820025.
    Video editing course in Noida
    best video editing course in Noida
    best video editing institute in Noida
    Company Address:
    Webtrackker Technology
    C- 67, Sector- 63, Noida
    Phone: 01204330760, 8802820025
    Email: info@webtrackker.com
    Website: http://webtrackker.com/Best-training-institute-Video-editing-FCP-course-in-Noida.php

    ReplyDelete
  12. This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
    Click here:
    python training in OMR
    Click here:
    python training in Bangalore

    ReplyDelete
  13. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    Blueprism training in Chennai

    Blueprism training in Bangalore

    Blueprism training in Pune

    ReplyDelete
  14. Its really an Excellent post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog. Thanks for sharing....
    Blueprism training in Chennai

    Blueprism training in Bangalore

    Blueprism training in Pune

    ReplyDelete
  15. Wonderful article, very useful and well explanation. Your post is extremely incredible. I will refer this to my candidates...
    Data Science training in Chennai
    Data science training in bangalore
    Data science online training
    Data science training in pune

    ReplyDelete
  16. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 

    best rpa training in chennai |
    rpa training in chennai |
    rpa training in bangalore
    rpa training in pune | rpa online training

    ReplyDelete
  17. This is a nice article here with some useful tips for those who are not used-to comment that frequently. Thanks for this helpful information I agree with all points you have given to us. I will follow all of them.
    Data Science course in rajaji nagar | Data Science with Python course in chenni
    Data Science course in electronic city | Data Science course in USA
    Data science course in pune | Data science course in kalyan nagar

    ReplyDelete
  18. Its really an Excellent post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog. Thanks for sharing....

    java training in chennai | java training in bangalore

    java interview questions and answers | core java interview questions and answers

    ReplyDelete
  19. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    German Classes in Chennai
    Java Training in Chennai
    German Training in Chennai
    german classes chennai
    Java Certification course in Chennai
    Java Coaching Center in Chennai

    ReplyDelete
  20. When I initially commented, I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get several emails with the same comment. Is there any way you can remove people from that service? Thanks.

    Amazon Web Services Training in Pune | Best AWS Training in Pune


    AWS Training in Pune | Best Amazon Web Services Training in Pune

    ReplyDelete

  21. Greetings. I know this is somewhat off-topic, but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform like yours, and I’m having difficulty finding one? Thanks a lot.
    AWS Interview Questions And Answers

    AWS Tutorial |Learn Amazon Web Services Tutorials |AWS Tutorial For Beginners


    AWS Online Training | Online AWS Certification Course - Gangboard

    AWS Training in Toronto| Amazon Web Services Training in Toronto, Canada

    ReplyDelete
  22. Thank you very much for this. I hope this will be useful for many people. Please keep on updating these type of blogs with good content.Thank You...
    aws online training
    aws training in hyderabad
    amazon web services(AWS) online training

    ReplyDelete
  23. Some us know all relating to the compelling medium you present powerful steps on this blog and therefore strongly encourage
    contribution from other ones on this subject while our own child is truly discovering a great deal.
    Have fun with the remaining portion of the year.
    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

    ReplyDelete
  24. You’ve written a really great article here. Your writing style makes this material easy to understand.. I agree with some of the many points you have made. Thank you for this is real thought-provoking content
    angularjs online training

    apache spark online training

    informatica mdm online training

    devops online training

    aws online training

    ReplyDelete
  25. Nice post. By reading your blog, i get inspired and this provides some useful information. Thank you for posting this exclusive post for our vision. 
    Microsoft Azure online training
    Selenium online training
    Java online training
    Python online training
    uipath online training

    ReplyDelete
  26. Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
    If you are looking for any Data science Related information please visit our website data science institutes in bangalore page!

    ReplyDelete
  27. This comment has been removed by the author.

    ReplyDelete
  28. https://campusjobdotin.blogspot.com/2016/10/hadoop-interview-questions-and-answers-1.html?showComment=1535018172318

    ReplyDelete
  29. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

    Digital marketing courses in Bangalore

    ReplyDelete
  30. Such a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing. data science courses

    ReplyDelete
  31. great info about hadoop in this blog At SynergisticIT we offer the best Full Stack course training in california

    ReplyDelete
  32. This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
    sap training in chennai

    sap training in omr

    azure training in chennai

    azure training in omr

    cyber security course in chennai

    cyber security course in omr

    ethical hacking course in chennai

    ethical hacking course in omr

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete
  34. very very informative,really gained so much of knowledge,thank you for providing
    Data Science Training in hyderabad

    ReplyDelete
  35. wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and also add rss. keep us updated.
    data scientist training and placement in hyderabad

    ReplyDelete
  36. Iran is not calling the shots in Iraq anymore; reports pour in as a unanimous backlash against Tehran is pouring onto the social media. Iraq is getting ready for elections on October 10. The population has awakened to the long-stayed exploitation and how Iran has repeatedly used Iraq to promote its agenda without thinking of the betterment of the latter. Read this complete news at The Arab Post.

    ReplyDelete
  37. I'm genuinely getting a charge out of scrutinizing your richly formed articles. Apparently you consume a huge load of energy and time on your blog. I have bookmarked it and I am expecting scrutinizing new articles. Continue to do amazing.cloud computing course in delhi

    ReplyDelete
  38. Informative blog and knowledgeable content. Thanks for sharing this awesome blog with us. If you want to learn data science then follow the below link.
    Data Science Training and Placements in Hyderabad

    ReplyDelete
  39. "Upgrade your work-from-home setup with AM Computers LLC and Buy Laptop Online UAE today!"

    ReplyDelete
  40. "Indulge in the flavorful world of A2Z Tobacco with our exquisite Loose Leaf Wraps - a taste adventure like no other!"

    ReplyDelete
  41. "Unleash a burst of irresistible sweetness with A2Z Tobacco's Runtz Wraps - Take your smoking journey to the next level and savor the flavor!"

    ReplyDelete
  42. "Ensure the safety of your property with Taylor Made Solutions' advanced Fire Alarm System , providing reliable detection and swift response for comprehensive fire protection."

    ReplyDelete
  43. "Channel Your Inner Legend - Savor the authentic taste with Al Capone Leaf Wrap by A2Z Tobacco, a smoking experience like no other!"

    ReplyDelete
  44. "Unleash an atmosphere of wonder and intrigue with Applied Physics USA's Dry Ice Fog Machine – Transform any moment into a mesmerizing experience!"

    ReplyDelete
  45. "Experience love without speed bumps – Activate your 1st Radar Detectors for a smoother ride on the Valentine Radar Detectorof emotions!"

    ReplyDelete
  46. "Discover unparalleled potency with Karatom Point's exclusive selection, featuring the renowned mit 45 extract for an elevated kratom experience unlike any other."

    ReplyDelete
  47. "Unlock your best look at The Head Game, the ultimate Hair Salons In Roseville, where expertise meets creativity for stunning transformations that turn heads."

    ReplyDelete
  48. "Unlock the power of engagement with 1st Position Ranking - Craft a winning Email Marketing Strategys to reach the forefront of your industry!"

    ReplyDelete
  49. Revitalize your hydration game with Applied Physics USA's Pouches Of Water - where science meets refreshment, quenching your thirst for excellence."

    ReplyDelete