User:LI AR/Books/Cracking the DataScience Interview

The Wikimedia Foundation's book rendering service has been withdrawn. Please upload your Wikipedia book to one of the external rendering services.

You can still create and edit a book design using the Book Creator and upload it to an external rendering service:

MediaWiki2LaTeX provides a softcopy PDF service. Uniquely, it remains under active support and may be used online or installed locally.
Pedia Press offer final tidying and ordering of print-on-demand bound copies in (approximately) A5 format.

For help with downloading a single Wikipedia page as a PDF, see Help:Download as PDF.


Cracking The Data Science Interview Basic Stuff To Know

This user book is a user-generated collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book. If you are the creator of this book and need help, see Help:Books (general tips) and WikiProject Wikipedia-Books (questions and assistance).

Edit this book: Book Creator · Wikitext

Order a printed copy from: PediaPress

[ About ] [ Advanced ] [ FAQ ] [ Feedback ] [ Help ] [ WikiProject ] [ Recent Changes ]

Cracking the DataScience Interview

Basic Stuff To Know

Generic pages: Glossaire_de_l'exploration_de_données; Big_data

Inspired from books like:
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
- "120 real data science interview questions"

Tips / Known Limits of DS

DataScience is (very) experimental (Andrew Ng): https://pbs.twimg.com/media/CBXshmjWgAAgLKa.jpg

Bias–variance_tradeoff / http://www.ritchieng.com/machinelearning-learning-curve/

Survivorship_bias

Correlation_does_not_imply_causation

Curse_of_dimensionality

Vanishing_gradient_problem

Machine Learning definition and types: Artificial_intelligence; List_of_machine_learning_concepts; Machine_learning; Data_mining; Knowledge_extraction; Knowledge_extraction#Knowledge_discovery; Pattern_recognition; Signal_processing; Supervised_learning; Semi-supervised_learning; Unsupervised_learning; Reinforcement_learning; Online_machine_learning; Incremental_learning; Q-learning; One-shot_learning / https://www.quora.com/What-is-zero-shot-learning; Feature_learning; Learning_to_rank; Similarity_learning; Biclustering; Natural_language_processing; Biomimetics; Collective_intelligence; Data_stream_mining; Sequential_pattern_mining; Clickstream; Semantics; Semantic_Web; Speech_recognition; Speech_synthesis; Collaborative_filtering

Competitions

Datasets: List_of_datasets_for_machine_learning_research

Software

http://www.databaseetl.com/data-mining-tools/
IDEs / DS-GUI
- R
  - (DS-GUI) :Rattle_GUI http://rattle.togaware.com/
  - (IDE) :RStudio https://www.rstudio.com
- Python
  - (DS-GUI) :Orange_(software) https://orange.biolab.si/
  - (IDE) :Project_Jupyter#Jupyter_Lab https://jupyterlab.readthedocs.io
- Java
- Online
  - DEAD http://www.gamifiedonlineweka.ga/
- Paid Software
  - (DS-GUI) :Minitab https://minitab.com/
  - (DS-GUI) :Tableau_Software https://www.tableau.com/
R/Packages
Python
- https://www.python.org/
- :Scikit-learn http://scikit-learn.org/stable/
C++
- https://orange.biolab.si/
Alteryx
- https://www.alteryx.com/ [Commercial]
Comparison
- http://onlinelibrary.wiley.com/wol1/doi/10.1002/widm.1204/full
DeepLearning
GANs (Generative Adversial Networks)
DataViz
- https://matplotlib.org/
- https://plot.ly/
- :GGobi http://www.ggobi.org/
- http://ggplot2.org/
- http://ggvis.rstudio.com/
- https://d3js.org/
- https://datascienceplus.com/creating-graphs-with-python-and-goopycharts/
- https://www.tableau.com/ [Commercial]
- http://bokeh.pydata.org/en/latest/ [Python]
- http://pyqtgraph.org/ [Python]
- https://uber.github.io/deck.gl [Uber's internal DataViz tool]
- http://rawgraphs.io/
- http://scidavis.sourceforge.net/
- http://home.gna.org/veusz/
- http://jwork.org/dmelt/
Graphs
GUI
- https://www.rstudio.com/products/shiny/

Data Manipulation: Annotate examples: https://prodi.gy/; Data_pre-processing; Data_cleansing; Data_reduction; Data_wrangling; Data_scrubbing; Data_editing; Data_scraping; Data_curation; Data_pre-processing; Data_fusion; Data_integration; Data_binning; Sanitization_(classified_information); Extract,_transform,_load; Imputation_(statistics); Interpolation; Outlier

Local_case-control_sampling#Imbalanced_datasets

Sampling_(statistics)

Sampling_(statistics)#Stratified_sampling

Stratified_sampling

Jackknife_resampling

Oversampling_and_undersampling_in_data_analysis

Oversampling_and_undersampling_in_data_analysis#SMOTE

"Essay Why Most Published Research Findings Are False"
- http://robotics.cs.tamu.edu/RSS2015NegativeResults/pmed.0020124.pdf
"A Few Useful Things to Know about Machine Learning"
- https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Working with text

Unicode_equivalence#Normalization

URL_normalization

Text_segmentation

Tokenization_(lexical_analysis)

Word2vec https://www.tensorflow.org/tutorials/word2vec

https://google.github.io/seq2seq/
NLP in Python

 https://github.com/explosion/thinc

Working with spatial data

Trend_surface_analysis

Spatial_descriptive_statistics#Ripley.27s_K_and_L_functions

Signal processing

Dynamic_time_warping

Signal processing - Images

Normalization_(image_processing)

Normalized_frequency_(unit)

Image_segmentation

Techniques for Feature/Attribute Selection/Dimensionality Reduction: High-dimensional_statistics; Dimensionality_reduction; Factor_analysis; Principal_component_analysis; Independent_component_analysis; Singular_value_decomposition; Multidimensional_scaling; T-distributed_stochastic_neighbor_embedding; Autoencoder; Deep_learning#Stacked_.28de-noising.29_auto-encoders; Elastic_map; Linear_discriminant_analysis

Signal processing

Compressed_sensing

Working with spatial data

Spatial_analysis

Spatial_analysis#Spatial_dependency_or_auto-correlation

Maths (Stats / Algebra)

Inspiration for this section: https://github.com/soulmachine/machine-learning-cheat-sheet

Pseudo-random_number_sampling

Glossary_of_probability_and_statistics

Bijection,_injection_and_surjection

Mode_(statistics)

Range_(mathematics)

Interquartile_range

Standard_deviation

Collinearity#Usage_in_statistics_and_econometrics

Exponential_smoothing

https://stats.stackexchange.com/questions/100019/window-models-in-stream-data-processing

Autoregressive_model

Autoregressive–moving-average_model

Autoregressive_integrated_moving_average

Autocorrelation

Cross-correlation

Entropy_in_thermodynamics_and_information_theory

Moment_(mathematics)

Likelihood_function

Cumulative_distribution_function

Probability_mass_function

Probability_density_function

Prior_probability

Prior_knowledge_for_pattern_recognition

Permutation https://fr.wikipedia.org/wiki/Arrangement

Combination https://fr.wikipedia.org/wiki/Combinaison_(math%C3%A9matiques)

Dependent_and_independent_variables

Independence_(probability_theory)

Hoeffding's_inequality

Pareto_efficiency

Nash_equilibrium

Pareto_principle

Taxicab_geometry

Norm_(mathematics)#Euclidean_norm

Norm_(mathematics)

Trace_(linear_algebra)

Eigenvalues_and_eigenvectors

Projection_(mathematics)

Hadamard_product_(matrices)

Kernel_(statistics)

Radial_basis_function

Latent_variable

Statistical_inference

Inductive_reasoning

Deduction_and_induction

Transduction_(machine_learning)

Stochastic_process

Probability_theory

Posterior_probability

Bayesian_inference

https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/

Bayesian_network

Naive_Bayes_spam_filtering

Naive_Bayes_classifier

Belief_propagation#Approximate_algorithm_for_general_graphs

Regularization_(mathematics)

Normalization_(statistics)

Quantile_normalization

Nyström_method (+PCA)

Preference_(economics)

Delaunay_triangulation

Neighbourhood_(mathematics)

Genetic Algorithms

Mutation_(genetic_algorithm)

Crossover_(genetic_algorithm)

Selection_(genetic_algorithm)

Fitness_function

Utility#Utility_functions

SVM

Kernel_(image_processing)

Kernel_(statistics)

Neural Networks

Rectifier_(neural_networks)

Backpropagation

Gradient_descent

Stochastic_gradient_descent

Gradient_boosting

Softmax_function

- Softmax is a "discriminant learning metric": examples for all classes!={i} help learn even for class {i} since sum of evaluations is forced to be 1 (the method creates a link in the evaluations of the classes)

Sigmoid_function

Hyperbolic_function#Tanh

Dropout_(neural_networks)

Radial_basis_function

Signal processing

Signal_processing

Low-pass_filter

High-pass_filter

Energy_(signal_processing)

Fast_Fourier_transform

Discrete_wavelet_transform

Coherence_(signal_processing)

Time Series

Decomposition_of_time_series

Seasonal_adjustment

Frequency_domain

Spectral_density

A*_search_algorithm

Multi-armed_bandit

Distances: Distance; Euclidean_distance [dim1]; Edit_distance; Hamming_distance; Manhattan_distance [dim1]; Levenshtein_distance; Needleman–Wunsch_algorithm; Minkowski_distance [dim n == generalization]; Mahalanobis_distance; Canberra_distance; Distance_correlation; Angular_distance; String_metric; Jaro–Winkler_distance; Jaccard_index; Kendall_tau_distance; Chebyshev_distance; Tf–idf; Neural_coding

For graphs: http://blog.smola.org/post/33412570425
https://fr.wikipedia.org/wiki/Algorithme_de_Needleman-Wunsch
Clouds

Hausdorff_distance [between clouds of points, a point and a cloud]

Distance#Distances_between_sets_and_between_a_point_and_a_set

Distributions

https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Discrete_uniform_distribution

Normal_distribution

Bernoulli_distribution

Binomial_distribution

Poisson_distribution

Chi-squared_distribution

Log-normal_distribution

Pareto_distribution

Chi-squared_distribution

Gibbs_distribution

Weibull_distribution

Gamma_distribution

Beta_distribution

Hypergeometric_distribution

Dirac_delta_function

https://ercim-news.ercim.eu/en107/special/robust-and-adaptive-methods-for-sequential-decision-making [Characterization of the simplicity of a distribution: BernsteinExponent+TsybakovMarginCondition]

Evaluation: Performance_indicator; Mean_absolute_percentage_error; Mean_absolute_scaled_error; Symmetric_mean_absolute_percentage_error; Regression-kriging

Information_gain_ratio

Kullback–Leibler_divergence

Gini_coefficient

Pearson_correlation_coefficient

http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/node15.html

Akaike_information_criterion https://twitter.com/DataSciFact/status/963129411250933760

Bayesian_information_criterion

Brier_score == RMSE

Structural_similarity

Type_I_and_type_II_errors

False_positive_rate

False_coverage_rate

False_discovery_rate

Confusion_matrix

Accuracy_and_precision

Precision_and_recall

Sensitivity_and_specificity

Receiver_operating_characteristic

Receiver_operating_characteristic#Area_under_the_curve

Discounted_cumulative_gain

Cross-validation_(statistics)

Errors_and_residuals

If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)

Heteroscedasticity

Clustering

- See also the Calinski-Harabasz Index: http://stats.stackexchange.com/questions/97429/intuition-behind-the-calinski-harabasz-index

Silhouette_(clustering)

- Others

Item_response_theory

- http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/#elbow-method

Working with Text: Part_of_speech; Semantic_similarity; Tf–idf; Cosine_similarity; Okapi_BM25

See also Mr Gomez page on Weka: http://www.esp.uem.es/jmgomez/tmweka/

Named-entity_recognition

Conditional_random_field

Latent_Dirichlet_allocation

Sentiment_analysis

Document_classification

Automatic_summarization

Working with Images

http://mirror.imagej.net/plugins/mexican-hat/index.html
- If your model seeks to penalize near misses, the Mexican hat function is a good choice.

Working with concepts (Ontologies)

https://en.wikipedia.org/wiki/YAGO_%28database%29 http://wiki.dbpedia.org/ http://conceptnet.io/ http://cogcomp.org/Data/QA/QC/definition.html

Visualization: Data_visualization; Exploratory_data_analysis; List_of_graphical_methods; Category:Statistical_charts_and_diagrams; Statistical_graphics; Visual_perception; Heat_map; Misleading_graph; Pareto_chart

Need to develop "critical thinking":
- https://www.nytimes.com/column/whats-going-on-in-this-graph
- https://www.nytimes.com/column/learning-whats-going-on-in-this-picture

(Statistical) tests: A/B_testing

Evaluating an hypothesis

Statistical_power

Statistical_hypothesis_testing

Student's_t-test

Chi-squared_test

Type_I_and_type_II_errors

Detecting abrupt changes in time series

Stationary_process

Structural_break

Kruskal–Wallis_one-way_analysis_of_variance

Pairwise_summation

MOSUM: https://cran.r-project.org/web/packages/strucchange/vignettes/strucchange-intro.pdf
Time series / Chaos

Lyapunov_exponent

Kolmogorov_complexity

Machine Learning Techniques: Statistical_classification; One-class_classification; Binary_classification; Multiclass_classification; Multi-label_classification; Structured_prediction; Cluster_analysis; Elbow_method_(clustering); Nearest_neighbor_search#Approximate_nearest_neighbor; Regression_analysis; Linear_regression; Logistic_regression; Ridge_regression; Kriging; Multivariate_adaptive_regression_splines; Association_rule_learning; Apriori_algorithm; Survival_analysis; Monte_Carlo_method; Monte_Carlo_algorithm; Multinomial_logistic_regression; Lasso_(statistics); Expectation–maximization_algorithm; Markov_chain_Monte_Carlo; Hidden_Markov_Models; Viterbi_algorithm; Convolutional_code; Forward–backward_algorithm; Markov_random_field; Mean_field_theory; Mean_field_particle_methods; CART; Decision_tree_learning; Decision_tree; Pruning_(decision_trees); ID3_algorithm; C4.5_algorithm; Random_forest; Support_vector_machine; Support_vector_machine#Support_vector_clustering_.28SVC.29; Support_vector_machine#Regression; Conditional_random_field; Latent_semantic_analysis; Genetic_algorithm; Evolutionary_algorithm; Evolutionary_computation; Voronoi_diagram; Local_outlier_factor; Ordered_weighted_averaging_aggregation_operator; Support_vector_machine

Neural Networks
- History: http://www.chronicle.com/article/The-Believers/190147/
- The various types of NN as a picture: http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png

Types_of_artificial_neural_networks

Comparison_of_deep_learning_software/Resources

Artificial_neural_network

Feedforward_neural_network

Multilayer_perceptron

Radial_basis_function_network

Long_short-term_memory

Time_delay_neural_network

Recursive_neural_network

Recurrent_neural_network

Hopfield_network

Content-addressable_memory

Boltzmann_machine

Self-organizing_map

Learning_vector_quantization

Long_short-term_memory

Liquid_state_machine

Autoassociative_memory

Convolutional_neural_network

Neuroevolution_of_augmenting_topologies

Deep_learning#Deep_neural_network_architectures

Deep_belief_network

Generative_adversarial_networks

- https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks

Neural_Turing_machine

- http://spinningbytes.com/demos/

Instantaneously_trained_neural_networks

Spiking_neural_network

Signal Processing

Optical_character_recognition

Fuzzy Logic

Inference_engine

Type-2_fuzzy_sets_and_systems

T-norm_fuzzy_logics

Adaptive_neuro_fuzzy_inference_system

Fuzzy_control_system

Working with spatial data

Spatial_association

Ensemble Techniques

Weak learner: https://stats.stackexchange.com/questions/82049/what-is-meant-by-weak-learner#82063

Ensemble_learning

Ensembles_of_classifiers

Ensemble_learning#Implementations_in_statistics_packages

Ensemble Learning = Boosting, Bagging or Stacking: http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning#19053
Applying Bagging should help reduce variance and overfitting.

Bootstrap_aggregating

Boosting_(machine_learning)

Gradient_boosting

Committee_machine

Applications: Bayesian_spam_filtering; Root_cause_analysis; Inpainting

https://github.com/phillipi/pix2pix
https://www.youtube.com/user/keeroyz
Chatbots
- Personality
  - https://en.wikipedia.org/wiki/Big_Five_personality_traits

Experimentation framework

Goal: test various parameters on various algorithms to determine the best model(s)
Weka's "Experimenter" mode: http://weka.sourceforge.net/manuals/ExplorerGuide.pdf
AutoWeka: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
R::mlrMBO: https://github.com/mlr-org/mlrMBO

Coding / Exposing API to the rest of the application: Microservices

BigData: Data_lake; Streaming_algorithm; Star_schema; OLAP_cube; Solid-state_drive; MongoDB

Map-Reduce framework

Apache_Hadoop https://hadoop.apache.org/

Scrapping

Apache_Flume http://flume.apache.org/

Storage

Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Apache_HBase http://hbase.apache.org/

Apache_Hive https://hive.apache.org/

Transfers - to/from RelationalDB

Sqoop http://sqoop.apache.org/

Transfers - serialization/streaming

Apache_Avro http://avro.apache.org/

Apache_Kafka https://kafka.apache.org/

Storage - In memory

Apache_Spark https://spark.apache.org/

Apache_Flink http://flink.apache.org/

Admin

Apache_ZooKeeper http://zookeeper.apache.org/

Apache_Cassandra https://cassandra.apache.org

Ambari http://ambari.apache.org/

Apache_Oozie http://oozie.apache.org/

Programming

Pig_(programming_tool) https://pig.apache.org/

ML

Apache_Mahout http://mahout.apache.org/

Apache_SystemML http://systemml.apache.org/

Working with text

Elasticsearch https://www.elastic.co/

Working with text - Data Viz

Kibana https://www.elastic.co/products/kibana

Small/Micro Data
- https://arxiv.org/abs/1610.00946

Multi-Agent Systems: Agent-based_model; Multi-agent_system; Agent-oriented_software_engineering

https://www.researchgate.net/publication/266182243_Agent_Groupe_Role_et_Service_Un_modele_organisationnel_pour_les_systemes_multi-agents_ouverts [JFerber: AGR Methodology]

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.7968&rep=rep1&type=pdf [YDemazeau: Vowels Methodology]

Ant_colony_optimization_algorithms

Quantum Machine Learning: Quantum_machine_learning; Quantum_tunnelling; Quantum_annealing; Adiabatic_quantum_computation

Resources

Books

Free Books

  https://github.com/janishar/mit-deep-learning-book-pdf

  https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print10.pdf

- http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

  http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

  http://infolab.stanford.edu/~ullman/mmds/booka.pdf

- http://www.guidetodatamining.com/

  http://www.guidetodatamining.com/assets/guideChapters/Guide2DataMining.pdf

- http://www.mlyearning.org/

  https://github.com/ajaymache/machine-learning-yearning

Paid Books
- "Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms", Jeff Heaton, 2013, ISBN:9781493682225
- "Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms", Jeff Heaton, 2014, ISBN: 978-1499720570
- "Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks", Jeff Heaton, 2015, ISBN: 978-1505714340
- "Introduction to Machine Learning (Adaptive Computation and Machine Learning)", E. Alpaydin, MIT Press, 2004, ISBN: 978-0262012430
- "Machine Learning: An Artificial Intelligence Approach", R.S. Michalski, J.G. Carbonell, T.M. Mitchell, Symbolic Computation, 1983, ISBN:978-3540132981
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II", Antonio Gulli, CreateSpace, 2015, ISBN:978-1517216719
- "Artificial Intelligence a Modern Approach", Stuart Russell and Peter Norvig, Prentice Hall, 1995, ISBN:978-0131038059
- "An Introduction to MultiAgent Systems", Michael Wooldridge, John Wiley & Sons, 2009 (2nd ed), ISBN:978-0470519462
- "Data Mining: Practical Machine Learning Tools and Techniques", Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, Morgan Kaufmann, ISBN:978-0128042915
- "Agent Intelligence Through Data Mining", Andreas L. Symeonidis, Pericles A. Mitkas, Springer/Apress, ISBN:978-0387257570
- "Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence", Gerhard Weiss, 2000, ISBN:978-0262232036
- "Data science at the command line", Janssens, O'Reilly.
- Also look for MachineLearning, DeepLearning, Spark, Mahout, R, Python, SciKit-Learn, Data/Text Mining, ElasticSearch, Natural Language, Statistics @ O'Reilly, Packt, Manning/In Action, HeadFirst
Lists of good books

News/Blogs/RSS

Podcasts

YT Channels

MOOCs

Jobs

Teaching

http://edison-project.eu/edison/edison-data-science-framework-edsf

Curated list of similar pages

https://github.com/search?utf8=%E2%9C%93&q=curated+list+awesome+frameworks&type= https://github.com/josephmisiti/awesome-machine-learning https://github.com/onurakpolat/awesome-bigdata https://github.com/onurakpolat/awesome-analytics https://github.com/analyticalmonk/awesome-neuroscience https://github.com/igorbarinov/awesome-data-engineering https://github.com/quantmind/awesome-data-science-viz https://github.com/fasouto/awesome-dataviz https://github.com/qinwf/awesome-R https://github.com/datascience-python/awesome-datascience-python https://github.com/caesar0301/awesome-public-datasets

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.