Posts

Showing posts with the label Process & Miscellaneous

Give examples of bad and good visualizations

 Give examples of bad and good visualizations Bad visualization: - Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items - Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment - 3D charts: can distort perception and therefore skew data - Using a solid line in a line chart: dashed and dotted lines can be distracting Good visualization: - Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better - Adding a trend line (regression line) to a scatter plot help the reader highlighting trends

Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics?

Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics? Pareto rule: - 80% of the effects come from 20% of the causes - 80% of the sales come from 20% of the customers Computer science: "simple and inexpensive beats complicated and expensive" - Rod Elder Finance, rule of 72: - Estimate the time needed for a money investment to double - 100$ at a rate of 9%: 72/9=8 years Rule of three (Economics): - There are always three major competitors in a free market within one industry

Explain the difference between "long" and "wide" format data. Why would you use one or the other?

Explain the difference between "long" and "wide" format data. Why would you use one or the other? -Long: one column containing the values and another column listing the context of the value Fam_id year fam_inc -Wide: each different variable in a separate column Fam_id fam_inc96 fam_inc97 fam_inc98 Long Vs Wide: - Data manipulations are much easier when data is in the wide format: summarize, filter - Program requirements

What is your definition of big data?

What is your definition of big data? Big data is high volume, high velocity and/or high variety information assets that require new forms of processing - Volume: big data doesn't sample, just observes and tracks what happens - Velocity: big data is often available in real-time - Variety: big data comes from texts, images, audio, video... Difference big data/business intelligence: - Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc. - Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors

Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?

Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?  Depends on the context? -"premature optimization is the root of all evils" -At the beginning: quick-and-dirty model is better -Optimization later Other answer: - Depends on the context - Is error acceptable? Fraud detection, quality assurance

How to detect individual paid accounts shared by multiple users?

How to detect individual paid accounts shared by multiple users? -Check geographical region: Friday morning a log in from Paris and Friday evening a log in from Tokyo -Bandwidth consumption: if a user goes over some high limit -Counter of live sessions: if they have 100 sessions per day (4 times per hour) that seems more than one person can do

How would you come up with a solution to identify plagiarism?

How would you come up with a solution to identify plagiarism? -Vector space model approach -Represent documents (the suspect and original ones) as vectors of terms -Terms: n-grams; n=1 to as much we can (detect passage plagiarism) -Measure the similarity between both documents -Similarity measure: cosine distance, Jaro-Winkler, Jaccard -Declare plagiarism at a certain threshold

Explain Tufte's concept of "chart junk"

Explain Tufte's concept of "chart junk" All visuals elements in charts and graphs that are not necessary to comprehend the information represented, or that distract the viewer from this information Examples of unnecessary elements include: - Unnecessary text - Heavy or dark grid lines - Ornamented chart axes - Pictures - Background - Unnecessary dimensions - Elements depicted out of scale to one another - 3-D simulations in line or bar charts

What is POC (proof of concept)?

What is POC (proof of concept)? -A realization of a certain method to demonstrate its feasibility -In engineering: a rough prototype of a new idea is often constructed as a proof of concept

How frequently an algorithm must be updated?

How frequently an algorithm must be updated? You want to update an algorithm when: - You want the model to evolve as data streams through infrastructure - The underlying data source is changing - Example: a retail store model that remains accurate as the business grows - Dealing with non-stationarity Some options: - Incremental algorithms: the model is updated every time it sees a new training example Note: simple, you always have an up-to-date model but you can't incorporate data to different degrees. Sometimes mandatory: when data must be discarded once seen (privacy) - Periodic re-training in "batch" mode: simply buffer the relevant data and update the model every-so-often Note: more decisions and more complex implementations How frequently? - Is the sacrifice worth it? - Data horizon: how quickly do you need the most recent training example to be part of your model? - Data obsolescence: how long does it take before data is irrelevant to the model? A...

How to efficiently scrape web data, or collect tons of tweets?

How to efficiently scrape web data, or collect tons of tweets? -Python example -Requesting and fetching the webpage into the code: httplib2 module -Parsing the content and getting the necessary info: BeautifulSoup from bs4 package -Twitter API: the Python wrapper for performing API requests. It handles all the OAuth and API queries in a single Python interface -MongoDB as the database -PyMongo: the Python wrapper for interacting with the MongoDB database -Cronjobs: a time based scheduler in order to run scripts at specific intervals; allows to bypass the "rate limit exceed" error

What is the life cycle of a data science project ?

What is the life cycle of a data science project ? 1. Data acquisition Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes 2. Data preparation Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction. 3. Hypothesis & modelling Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting 4. Evaluation & interpretation Steps 2 to 4 are repeated a number of times as ne...

What is star schema? Lookup tables?

What is star schema? Lookup tables? The star schema is a traditional database schema with a central (fact) table (the "observations", with database "keys" for joining with satellite tables, and with several fields encoded as ID's). Satellite tables map ID's to physical name or description and can be "joined" to the central fact table using the ID fields; these tables are known as lookup tables, and are particularly useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve multiple layers of summarization (summary tables, from granular to less granular) to retrieve information faster. Lookup tables: - Array that replace runtime computations with a simpler array indexing operation

Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics? Hash tables: - Average case O(1)O(1) lookup time - Lookup time doesn't depend on size Even in terms of memory: - O(n)O(n) memory - Space scales linearly with number of elements - Lots of dictionaries won't take up significantly less space than a larger one In-database analytics: - Integration of data analytics in data warehousing functionality - Much faster and corporate information is more secure, it doesn't leave the enterprise data warehouse Good for real-time analytics: fraud detection, credit scoring, transaction processing, pricing and margin analysis, behavioral ad targeting and recommendation engines

Compare R and Python

Compare R and Python R - Focuses on better, user friendly data analysis, statistics and graphical models - The closer you are to statistics, data science and research, the more you might prefer R - Statistical models can be written with only a few lines in R - The same piece of functionality can be written in several ways in R - Mainly used for standalone computing or analysis on individual servers - Large number of packages, for anything! Python - Used by programmers that want to delve into data science - The closer you are working in an engineering environment, the more you might prefer Python - Coding and debugging is easier mainly because of the nice syntax - Any piece of functionality is always written the same way in Python - When data analysis needs to be implemented with web apps - Good tool to implement algorithms for production use

Provide examples of machine-to-machine communications

Provide examples of machine-to-machine communications Telemedicine - Heart patients wear specialized monitor which gather information regarding heart state - The collected data is sent to an electronic implanted device which sends back electric shocks to the patient for correcting incorrect rhythms Product restocking - Vending machines are capable of messaging the distributor whenever an item is running out of stock

Examples of NoSQL architecture

Examples of NoSQL architecture -Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB -Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA -Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB -Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J

How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both

How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both "Premature optimization is the root of all evil"; Donald Knuth Parallel processing: for instance in R with a single machine. - doParallel and foreach package - doParallel: parallel backend, will select n-cores of the machine - for each: assign tasks for each core - using Hadoop on a single node - using Hadoop on multi-node Faster algorithm: - In computer science: Pareto principle; 90% of the execution time is spent executing 10% of the code - Data structure: affect performance - Caching: avoid unnecessary work - Improve source code level For instance: on early C compilers, WHILE(something) was slower than FOR(;;), because WHILE evaluated "something" and then had a conditional jump which tested if it was true while FOR had unconditional jump.