WHAT IS DATA SCIENCE JOB :
Now-a-days, Data has now become the raw material for businesses and vast amounts of structured and unstructured information are being increasingly used to create a new form of economical value for any organization, so most important sector and many organizations, industry, government and other sectors in this field globally are highly required employ with the right combination of data science related technical , analytical knowledge and excellent communication skills. Its organized, using and collecting very large amounts of data during their everyday operations, thus a data scientist job is to use data to find pattern and help to solve its problems exactly and data extract, analyse and interpret the large amounts of data from a range of sources using through many kinds of knowledge as- Statistical tool and technique, artificial intelligence, machine learning, data mining etc. So Recruiter asked many types of questions related data science at the time of interview as you can job work across a range of areas such as- Health, Banking, Finance, Consumer, Retail, Information Technology, Governments, E-commerce, Education and many more.
DATA SCIECE JOB RESPONSIBILITIES :
* Working closely with your organizations related to identify problems and its use data to purpose solutions most effectively decision making.
* To build algorithms and design to visualizing experiments to merge, manage, interrogate, manipulate and extract data to supply and reports to your HR, Colleagues and Clients.
* Have capability to organize for the organizations data science strategy.
* Establish new systems as per your knowledge and process to look for opportunities to improve the flow of quality data.
* Evaluate new and emerging technologies as per your industry and represent the company at external events and conferences.
* To build and development good relationships with your clients always.
Let's now explain some important questions and answers to necessary at the time of your interview.
1.Q What is data science ?
Data Science is a combination of algorithms, tools and machines learning technique which helps you to find common hidden patterns from the given data.
2. Q : what is the difference between data science and big data ?
A: Large collection of data sets that can't be stored in a traditional system. Big data solves problems related to data management and handling, analyze insights resulting in informed decision making using through tools like Hadoop, Spark and NoSQL.
3. Q: How to check data quality ?
A: Checked data quality by Completeness, Consistency, Uniqueness, Integrity, Conformity and Accuracy.
4.Q: How to survey data and it has some missing data, how would you deal with missing values from that survey ?
A : Debugging techniques and searching the list of values, Filtering questions with checking for logical consistencies and counting the level of representativeness. Apply Imputation technique as- Random imputation, Hot-deck imputation and Imputation of the mean of subclasses.
5.How would you deal with missing random values from a data set ?
A : There are two forms of randomly missing values as- MCAR - Missing completely at random-means choose a pair wise or a list wise deletion of missing value cases and MAR- Missing at random- We can't predict the probability from the the variables in the model. Data imputation is mainly to replace them.
6.Q: What is Hadoop and why should I care ?
A : Hadoop is an open-source processing framework that manages data processing and storage for big data applications running on pooled systems and Hadoop is a collection of open source utility software that makes it easy to use a network of multiple computers to solve problems involving large amounts of data and computation. It provides a software framework for distributed storage and big data processing using the MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packets of code to nodes to process the data in parallel. It allows the data set to be processed faster and more efficiently than if conventional supercomputing architecture were used.
7.Q: What is fsck ?
A: fsck is a abbreviation for file system check and It is a type of command that searches for possible errors in the file. fsck generates a summary report, which lists the file systems overall health and sends it to the Hadoop distributed file system.
8.Q: Which is better - Good data or Good models ?
A : Many big company prefer good data for its successful business but good model couldn't be created without good data.
9.Q: What are Recommender systems ?
A : Recommender systems are a subclass of information filtering and used to predict how to users would rate or score for particular objects as- movies, music, merchandise etc. Mainly Recommender systems filter large volumes of information based on the data provided by a user and other factors and they take care of the users preference and interest. Recommender systems utilize algorithms that optimize the analysis of the data to build the recommendations. They ensure a high level of efficiency as they can associate elements of our consumption profiles such as purchase history, content selection and even our hours of activity to make accurate recommendations.
10. Q : What are the different types of recommender systems ?
A : There are three types of Recommender systems.
* Collaborative filtering- It is a method of making automatic predictions by using the recommendations of other people. There are two types of collaborative filtering techniques-
. User-User collaborative filtering
. Item- Item collaborative filtering
* Content based filtering- It is based on the description of an item and an users choices
* Hybrid Recommendation systems- Hybrid recommendations are a combination of diverse rating and sorting algorithms. A hybrid recommendation engine can recommend a wide range of products to consumers as per their history and preferences with precision.
11.Q: What is the difference between supervised and unsupervised learning?
A : Supervised Learning
. Uses known and labeled data as input
. Supervised learning has a feedback mechanism
.The most commonly used algorithms are decision trees ,logistic regression and support vector machine
Unsupervised Learning
. Unsupervised learning has no feedback mechanism
. The most commonly used unsupervised learning algorithms are k- means clustering, hierarchical clustering and apriorialgorithms.
12.Q: What is logistic regression in data science?
A: Logistic Regression is also called as the logit model. It is a method to forecast the binary outcome from a linear combination of predictor variables.
13. Q: Name three types of biases that can occur during sampling ?
A: In the sampling process, there are three types of biases as-
* Selection bias
* Under coverage bias
* Survivorship bias
14. Q: What is prior probability and likelihood ?
A: Prior probability is the proportion of the dependent variable in the data set while the likelihood is the probability of classifying a given observant in the presence of some other variable.
15.Q: Explain Recommender systems?
A : It is a subclass of information filtering techniques . It helps you to ratings which users likely to give to a product.
16.Q: Name three disadvantage of using a linear model ?
A : * You can't use this model for binary or count out comes.
* There are plenty of over fitting problems that it can't solve easily.
17. Q: Why do you required to perform resampling data?
A: * Estimating the accuracy of sample statistics by drawing randomly with replacement from a set of the data center to improve the accuracy or using as substitute of accessible data resampling.
* Substituting labels on data center when performing necessary tests and make use of a nested resampling process.
* Validating models by using random subsets
18. Q : What is linear Regression ?
A : Linear Regression is a statistical programming method where the score of a variable ' A' is predicted from the score of a second variable ' B' .
19. Q : Explain P-Value ?
A : When you conduct a hypothesis test in statistics , a p-value allows you to determine the strength of your results. It is a numerical number between 0 and 1 .
20. Q : Define the term deep learning ?
A : Deep learning is a substitute of machine learning. It is concerned with algorithms inspired by the structure data called Artificial Neural Networks- ANN
21. Q : What is normal distribution ?
A : A normal distribution is a set of a continuous variable spread across in the shape of a bell curve. As you can consider - continuous probability distribution which is useful in statistics . It's useful to analyze the variables and their relationships when we are using the normal distribution curve.
22. Q: Name various type of Deep learning frameworks ?
A: Deep learning framework mainly using for design, validating and use for high level programming.
* Pytorch
* Microsoft Cognitive Toolkit
* TensorFlow
* Caffe
* Chainer
* Keras
23.Q: What is precision ?
A : Precision is the most commonly used error metric is n classification mechanism . Its range is from 0 to 1 , where 1 represents 100%
24. Q : What is an API ? What are APIs used for ?
A : API stands for Application Programming Interfaces and It is a set of routines , protocols and tools for building software development applications to exchange of data.
25. Q : What is the cluster sampling ?
A : A Cluster sampling is a probability sampling methods where the researcher divides the population into separate of many smaller or natural groups of present population is called clusters. Then a simple cluster sample is selected from the population of data. The data researcher conducts his analysis of data from the sample pools.
CONTINUED......