Managerial Economics/Data Science, research, and insights

Data Science
Data science is an interdisciplinary field that utilises scientific methods to collect, store and analyse data from which insights are extracted from. It combines tools from various fields including statistics, computer science, machine learning and various social sciences. It has become an integral part of information technology. The field's importance has recently been growing as firms seek to use it to increase their productivity (Maneth & Poulovassilis, 2017).

Current and Future Applications of Data Science

 * 1) Internet Search
 * 2) * Data science has driven the development of algorithms to deliver accurate results to search queries in fractions of seconds since the early 2000s, paving the way for the ubiquitous term ‘google it’.
 * 3) Medical Image Analysis
 * 4) * Machine learning is growing in its ability to help detect tumours, organ delineation and other medical issues, reducing the need for human decision-making.
 * 5) Targeted Advertisements/Recommendations
 * 6) * Targeted advertisements and recommendations (such as for products or videos to watch on YouTube/Netflix) are a major application of data science, contributing to higher call-through-rates for advertisers and ensuring users are enticed into watching/consuming more content.
 * 7) Speech Recognition
 * 8) * Still in the early stages of development, data science and machine learning are integral to the progression of the technology. Speech recognition has many applications, including personal assistants and the possible replacement of many human jobs such as receptionists, call centres etc.
 * 9) Advanced Image Recognition
 * 10) * This requires huge processing power and is still in its infancy stages, applications of the technology are currently limited to simpler concepts such as tagging relevant people in Facebook posts or reverse image searching. Future applications may be much more useful such as allowing computers to inspect and approve items along production lines or recognise crimes and alert the police.

Big Data
Digital technology advancement has led to the creation and accumulation of significant amounts of unstructured data. This collective information, "Big Data", has formed a field of study that deals with data sets that are too large or complex to be dealt with by traditional data processing software. Due to its high frequency of availability, it has become cheap to generate. Data scientists however are expensive to employ, as the skill necessary to interpret big data is sought after in many industries. Big Data analytics rely on the extraction of users' personal data to produce individual user "profiles". These profiles bring together the personal sphere of an individual with the sphere of a commercial relationship, as "profiles" can be used to great effect with targeting advertising.

What Makes Data Big?
Three main attributes can be applied to big data.

Volume

The amount of data used causes data to increase. Currently, every person generates about 100+ GB of data each day. Due to progression in worldwide connectivity, social media, the amount of data produced on a regular basis is increasing. Across different platforms, due to the high volumes of data, it is bringing on challenges for firms for storage. In addition, it is challenging platforms to advance the technology and capability to process the data.

Velocity

Data is collected in real time. The increase in volume of data has put stress onto database systems. In order to collect in real time, the database must quickly process data from different sources, then forward it to the business that requires day to day operations. A good example of real time data is Amazon providing recommendations depending on the search and location of users. Services search through the data transactions and share relevant results which can tempt users into a different or additional purchase. In order to handle big data in real time, firms need very advanced processing technologies.

Variety

The variety of data refers to its various sources. These may include messages, tweets, images and video posted to social networks, readings from sensors, GPS signals from cell phones and security camera recordings. These sources have all become more prominent in recent years. The primary source of this data derives from applications such as Microsoft Excel, text and audio files and server log files. The files can usually be structured or unstructured. Structured data is presented in a clear format. In contrast, unstructured data refer to audio, text and video data as well all general data that does not have a pre-defined data model. Due to the increase in video, audio, and image sharing, the growth of unstructured data has occurred more recently.

There are two less-known dimensions of big data:

Variability

In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data. Therefore, it is fundamental for researchers to be aware of the degree of variability within big data to avoid higher margins of error.

Complexity

Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

Big Data Analytics
Big data analytics is the process of collecting, storing, and processing increasingly large and complex data sets from a variety of sources into competitive advantage By systematically analysing data, a data scientist can extract hidden trends that benefit the business by allowing it to make informed decisions.

Example: Starbucks

Imagine that Starbucks is concerned with the taste of a new coffee product. They decide to use the platform of social media to conduct market research.

Based on the feedback received they determine that the taste was fine but the price was too high. Therefore, Starbucks is able to make adjustments to increase customer satisfaction. Potential solutions range from reducing the price of new flavours drinks during a promotional period or offering a discounted price of coffee during set happy hours.

Components to Big Data Analytics


 * 1) Data management difficulty
 * 2) * The continuous formation of large data silos that goes untouched requires a wide array of analysis techniques, rinsing and repeating as often as possible. The reality of implementing data analytics for big data is much easier said than done. Each stage of analysis requires different skills therefore, this is an emerging industry making IT and data analytic specialists especially sought after.
 * 3) Quantum computing
 * 4) * The current state of technology will only allow for a certain capacity of analysing and interpreting the massive amount of data collected. The future of computing, quantum computing, will significantly improve analytics by allowing for billions of data to be processed within a few minutes. However, quantum computing is still undergoing research and development. Large tech companies, like IBM and Google, will soon be testing quantum computers and integrating them into their business processes.
 * 5) Smarter and tighter cyber-security
 * 6) * With technology advancing human capabilities positively, it also brings a negative aspect. Hacking and breaches are now more common and dangerous. However, companies are increasingly incorporating big data into their cybersecurity measures to prevent and mitigate the negative impacts of future cyber-attacks.
 * 7) Artificial intelligence
 * 8) * The vast amount of data available enables machine learning and artificial intelligence to learn more. This results in smarter bots that are increasingly integrated into many businesses and their online platforms.

Data Science Manager
The role of a Data Science Managers is to collect and analyze information, such as Big Data, to provide data-driven insights and solutions to business problems.

Data science managers are people who have knowledge of the software and hardware being used and have good communication skills along with domain knowledge. They are usually responsible for the following five major processes:


 * 1) To recruit data engineers and data scientists
 * 2) * A big part of data science is engineering. Systems must be devised to collect and organise the data, to ensure it can be extracted in an efficient way. Therefore the role of an engineer is to develop and execute this, so that useful knowledge and insights from structured and unstructured data can be utilised throughout a firm effectively.
 * 3) To identify problems which require to be solved
 * 4) * The data science manager required to think like a researcher to narrow down to one specific question in order to form a hypothesis.
 * 5) To arrange the right people into the right problems (effective task delegation)
 * 6) * The data manager checks the compatibility within the team to ensure the skills of the employees match the correct position.
 * 7) To set the goals and priorities
 * 8) * It is necessary to establish a clear path to reach them as it helps the manager to avoid becoming overwhelmed if they have multiple goals that trying to reach.
 * 9) To manage the process of data science
 * 10) * It is an administrative process to ensure accessibility, reliability, and timeliness of the data

Because data science is an interdisciplinary discipline, the ability of data science managers to integrate and distribute resources is critical. It is thus important for them to espouse traditional managerial qualities in conjunction with those previously stated. Ideally, they will also have good knowledge of the software and hardware being used and have good communication skills and domain knowledge.

Machine Learning
Machine learning algorithms are designed to build a mathematical model using pre-imputed training data, to make predictions on new data. Machine learning is an important tool in the technology used by data analysts to help predict future outcomes like the weather, economic growth, and human behaviour. As machine learning capabilities expand, these algorithms are being used more and more common in everyday products like smartphones, cars, washing machines, and even coffee machines.

Machine learning is often about recognising patterns in a data set, and it helps to automatically detect patterns in data and then use the patterns found as a tool for predicting future actions.

One disadvantage of machine learning is that it cannot adequately prove causation, only correlation. However, this may not be an issue due to data scientists are often only interested in prediction. Hence, conclusions can be useful for strategic implementations without knowing the reason behind it. However, in such a situation, there will be risks involved. For example, some implementations are invalid due to the different reasons for individual decision-making.

Machine learning is an emerging multi-disciplinary interdisciplinary discipline involving important theories from classical disciplines such as probability theory and statistics, and it is an important contribution to the solution of data mining problems. By training classifiers or algorithms on a large amount of known data, accumulating data through the system itself using continuously acquired data, and then achieving self-improvement to continuously improve performance, it can solve decision functions that are conducive to solving classification or regression problems, quickly and accurately predict similar unknown samples, and gradually realize machine instead of manual to complete repetitive and complex work. However, since specific data mining problems are very different and have different requirements for solving time and storage cost, the machine learning algorithms used are also completely different. Commonly used machine learning algorithms are decision tree algorithm, K-means algorithm, SVM algorithm, maximum expectation algorithm, and Bayesian classifier.

Machine learning paradigms

 * 1) Unsupervised Learning: Uses training data that contains the inputs, but not the outputs in order to build an algorithm to uncover certain patterns in the data. E.g automated targeting of ads to similar product/service users (Google Ads)
 * 2) Supervised Learning: uses training data that contains both the inputs and outputs to build an algorithm to predict the outcome when it is not observed. For example, a categorisation method in emailing (whether the email goes to junk or the inbox)
 * 3) Reinforcement Learning: Learn how to take action to maximise cumulative reward. There is a trade-off between exploration and exploitation. The concept is similar to the trade-off between innovation and economies of scale, where a company is a reallocation of limited capital. . For example, while someone is learning to ride a bike, the reward is staying upright, while the trade-off is falling off the bike and possibly injuring yourself.
 * 4) Self Learning: Learn to compute, in a crossbar fashion, both decisions about actions and emotions - driven by the interaction of cognition and emotion. This concept builds an algorithm to learn goal-seeking behaviour in an environment with desirable and undesirable situations.
 * 5) Feature Learning: This concept is where features are learned using either labelled or unlabelled input data. Example of this include artificial neural networks, dictionary learning and matrix factorisation. It is motivated by the fact that machine learning tasks (such as classification) have input data that is mathematically and computationally convenient to process.
 * 6) Anomaly Detection: Anomaly detection is the identification of rare items that differ significantly from the majority of the data. This is useful for observing events or items that raise suspicions such as bank fraud, medical problems or errors in text.

An Example of Machine learning
Machine learning, more specifically, deep machine learning is of great importance to firms that participate in the technology and Artificial Intelligence (AI) sector. As an example, take a company like Grammarly. This firm uses and relies on Artificial Intelligence systems, processes, and outputs to satisfy its consumers in the realm of grammatical editing. This is also known in further detail as an algorithm. Through the use of an algorithm and Artificial Intelligence, the firm uses other people's written words from a range of genres (academic, blog, professional, entertainment) and then make suggestions to other writers based on common and identifiable patterns. This is all possible due to machine learning. Furthermore, the concept of machine learning is relevant to another concept that was learned previously---critical mass. By definition, critical mass is the need for a certain number of people or participants or items to achieve the start, maintain, and achieve a certain venture. Hence, when referring to Grammarly and the algorithmic technology it uses, critical mass is critical. Firms like these, in fact, rely on critical mass to improve their algorithm and to learn from expert writers who feed the artificial intelligence, which then makes better changes for other writers. Overall, machine learning and artificial intelligence are the core of many technology companies and offer several benefits that would not be possible without it.

Computation
When it comes to developing machine learning and designing algorithms, a knowledge of programming languages is essential. Data scientists use these programming languages to write algorithms and create statistical models that are used by computer systems to perform tasks without receiving explicit instructions. There are many different programming languages each with their own strengths and weaknesses in performing different tasks.

Popular Programming Languages For Data Scientists

 * R: A popular and free open-source program used for statistical computing with a rich ecosystem and an open-source library of packages. Commonly used by statisticians and data miners.
 * Python: A programming language that is suited for machine learning algorithms at a large-scale. Python features code that is both easier to maintain and more robust that R. Commonly used by technology companies.
 * JavaScript: A well-known and widely used programming language. JavaScript is the preferred language for website developers. Less commonly used for data science as has minimal inbuilt functionality compared with other languages.
 * Java: Considered the most reliable programming language, Java is used primarily for developing databases, as well as many Android apps. Its verbosity makes it an unlikely first choice for a data scientist.
 * SQL: A domain-specific language commonly used by data scientists as a data extraction tool. Also useful for managing held data.

Data Visualisation
Data Visualisation is the graphical representation of data and information. It involves producing visual elements like charts, graphs and maps that communicate relationships between represented data to the viewers, providing an accessible and fast method to see and understand trends, outliers and patterns in data. Some online tools that are commonly used for data visualisation are: Tableau, PowerBI, Qlikview, Chart Studio, FusionCharts, Highcharts, Datawrapper, Sisense, Chart.js and D3.js.

Seven ways that data visualisation affects decision-making organisations are:


 * 1) Faster action
 * 2) * Humans tends to process visual information far more easily than written information. For example, using charts or graphs to summarize complex data ensures faster comprehension of relationships than cluttered reports or spreadsheets, making decision making easier.
 * 3) Communicate findings in constructive ways
 * 4) * Business reports are usually formalised documents, often inflated with static tables and a variety of chart types. For example, decision-makers can easily interpret wide and varying data sources through visualisations such as meaningful graphics that help engage and inform busy executives and business partners on problems and pending initiatives.
 * 5) Understand connections between operations and results
 * 6) * Data Visualisation allows users to highlight connections between operating results and overall business performance. Identifying correlations between business functions and market performance is essential in a competitive environment. For example, bar charts may alert the sales team that the sales of their products have dropped form the previous month.
 * 7) Embrace emerging trends
 * 8) * The data collected can expose new opportunities for adaptable companies by monitoring sales. For example, a jeans company may see that ripped jeans sale are rising, allowing them to promote ripped jeans well ahead of their competitors.
 * 9) Interact with data
 * 10) * Data visualisation exposes changes in a timely manner. For instance, big data visualisation tools can show a car manufacturer that sales of its sedan cars are down.
 * 11) Create new discussion
 * 12) * For example, it can show the development of product performance over time in multiple geographic areas, making it easier to see those that are performing very well or under-performing.
 * 13) Machine learning
 * 14) * Big companies such as Amazon, Google, Pinterest and Yelp use machine learning to eliminate spam email, show you relevant content and sort through user-uploaded photos. Social media platforms such as Facebook also uses machine learning to help suggest friends that share similar hobbies and interests with you.

Principals of Data Visualisation
Professor Edward Tufte explained that users of information displays are aiming to execute particular analytical tasks such as making comparisons, and so, the design principle of the information graphic should support the analytical task. Different graphical elements accomplish this more or less effectively, for example, pie charts are outperformed by dot plots and bar charts. Edward Tufte, in his 1983 book The Visual Display of Quantitative Information, defined the principles for effective graphical display as: "Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:


 * Show the data. Showing clusters is better than showing lines.
 * Induce the viewer to think more about the substance of data, rather than a methodology or graphic design.
 * Avoid distorting what the data has to say and avoid bias. e.g Manipulating of axes to make changes look more or less drastic than actual
 * Present many numbers in a small space.
 * Make large data sets coherent.
 * Encourage manual comparison of data when possible. Distance and colour matter when comparing two graphs. Differences should be easily discernible.
 * Reveal the data at several different levels of detail, from a broad overview down to a fine structure.
 * Serve a clear purpose: description, exploration, tabulation or decoration.
 * Be closely integrated with the statistical and verbal descriptions of a data set.

Reproducibility of Data Science Reports
An independent researcher must be able to replicate an experiment that has been already completed while achieving the same results. This allows potential errors in the collection or analysis of the data to be spotted and improved upon by a third party, ensuring the result is empirical. There are three main objectives of reproducibility:


 * 1) To allow verifiability of claim
 * 2) To increase the robustness of the findings
 * 3) To help others build up on the reported research

Some practices that are important to allow reproducibility of data science reports are:


 * The research report should be accompanied by the original data.
 * The data generating process should be explained to replicate the data generation process (e.g. experiment).
 * Data analysis should be fully automated and the code to produce the results should be made publicly available.
 * The analysis code should be written in a clear and concise way.

Artificial Intelligence
Artificial Intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. AI belongs to the branch of computer science, but as a new comprehensive high technology discipline, which covers a wide range of knowledge. As a new comprehensive high technology discipline, it covers a wide range of knowledge fields, including computer science, biology, linguistics, philosophy, system science and many other disciplines. AI processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction. Artificial Intelligence can be applied to any intellectual task. Some examples of AI are autonomous vehicles (in the form of drones and self-driving cars), mathematical theorem solving, art creation, medical diagnoses and search engines.

AI can be categorised as either weak or strong. Weak AI, also known as narrow AI, is an AI system that is designed and trained for a particular task. Weak AI focuses on "tool theory", which is the extension of intelligent machines as human tools. Weak AI refers to machines that solve and compute problems intelligently through algorithms written by humans, using computational tools such as neural networks, expert systems, and fuzzy logic, where human triggering or manipulation is required to make behavioral decisions. Virtual personal assistants, such as Apple's Siri, are a form of weak AI. Strong AI, also known as artificial general intelligence, is an AI system with generalised human cognitive abilities. Strong AI refers to the study and imitation of biological behavior or brain, with the aim of achieving the "trinity" of consciousness, emotion, and reasoning. Strong AI has human-like perception, thought and emotion, self-awareness, and the ability to reason logically and solve problems. It can perform brain work, develop solutions, solve difficult problems, understand abstract thinking, and learn from lessons learned and make autonomous decisions. When presented with an unfamiliar task, a strong AI system is able to find a solution without human intervention.

There are 3 key steps to developing artificial intelligence:


 * Defining the Domain Structure: Break a complex problem down into singular tasks that can be solved with machine learning. So far, success has been found in well-defined structured games such as Chess or Pacman. Domain Expertise is valuable when breaking down composite tasks in business applications.


 * Generating the Necessary Data: Conduct experiments to generate outcome data and build management systems to store and analyse this data. It is possible to simulate data for Machine Learning games.


 * Building Machine Learning Algorithms: Use aggregate data to build algorithms for each composite task and combine them to allow artificial intelligence to make decisions.

Types of Artificial Intelligence
Arend Hintze, an assistant professor of integrative biology and computer science and engineering at Michigan State University, has categorised AI into four different types, ranging from the simpler AI systems that exist today to sentient systems, which do not yet exist. The four types are as follows:


 * Type 1: Reactive Machines. An example of this is Deep Blue, the IBM chess program that beat Garry Kasparov in the 1990s. Deep Blue can identify pieces on the chess board and make predictions, but it has no memory and cannot use past experiences to inform future ones. It analyses its own and its opponents possible moves, and then chooses the most strategic move. Deep Blue and Google's AlphaGO were designed for narrow purposes and so cannot easily be re-applied to other situations. This is a form of Weak AI
 * Type 2: Limited Memory. These AI systems can use past experiences to make informed future decisions. Some of the decision-making functions in self-driving cars are designed this way. Observations inform actions happening in the not-so-distant future, such as a car changing lanes. These observations are not stored permanently. This is a form of Weak AI
 * Type 3: Theory of Mind. This psychology term refers to the understanding that others have their own beliefs, desires and intentions that impact the decisions they make. This is a form of Strong AI. This kind of AI does not yet exist.
 * Type 4: Self-Awareness. In this category, AI systems have a sense of self, they have consciousness. Machines with self-awareness understand their current state and can use the information to infer what others are feeling. This is a form of Strong AI. This type of AI does not yet exist.

How can firms make use of AI?
Artificial Intelligence is already widely used in business applications, including automation, data analysis and natural language processing. The adoption of AI by firms to process data and predict decisions is becoming more prevalent in various industries due to it streamlining operations and improving efficiencies. Some audit firms are already using artificial intelligence to process data because AI is faster and more accurate. Evidence shows that AI has outperformed the traditional logistic regression model in audit firms(Kirkos, Spathis, Manolopoulos, 2010). Japanese prime minister Shinzō Abe carried out an Artificial Intelligence Technology Strategy Council in 2016, which focused on using AI to achieve industrialisation in productivity in the fields of health and mobility (The Future Of Life Institute FLI, 2019). Artificial Intelligence is now used in Japan by some firms who have replaced waiters with as robots. OriHime-D robots were invented by Aki Yuki and Yoshifumi Shiiba in 2018 to sever as a robot waiter in a Tokyo cafe. (Fleming, 2018)

Other common uses for AI in business include:


 * Forecasting consumer behaviour and individualised product recommendations
 * Detection of Fraud
 * Personalised marketing messages and advertising
 * Automated customer service via telephone or chatbots
 * Cross-referencing and transferring data

Availability of Data
Data is the foundation of AI; therefore, it is imperative that it easily accessed and is of good quality. In order to create value from artificial intelligence, a firm must ensure that all data variables are captured. In the early days of adopting AI, companies typically find that their data is held across a variety of source systems, consequently generating ‘dirty data’ that is essentially useless without human intervention. This data must then be converted and imported into a common system before any processes can be built.

Skills Shortage
The availability of trained staff capable of managing AI technology provides another limitation to firms. A study conducted in 2019 shows that 93% of participating firms considered artificial intelligence to be a business priority moving into the future. However, 51% of the firms admitted that they are lacking the skilled personnel in order to implement these technologies into their processes. This forces firms to spend resources on upskilling employees in order to employ AI solutions.

Cost
Investing money into emerging technologies will naturally force firms to strongly consider their return on investment. At this stage in the capabilities of AI, it is difficult more decision-makers to build a business case to justify an investment in artificial intelligence. As is the case with most firms, the in-house skills are unsuitable for a rollout of AI. Therefore, firms must decide on one of two options: (1) invest into upskilling current employees/recruit skilled talent; or (2) outsource to an AI firm. With both solutions being costly and will likely show little short-term gain, it is crucial that the long-term benefits are considered by decision-makers. Finally, the complex nature of artificial intelligence means that maintenance and repair of systems is incredibly expensive.

Risks And Biases
“It would be nice if all of the data which sociologists require could be enumerated because then we could run them through IBM machines and draw charts as the economists do. However, not everything that can be counted counts, and not everything that counts can be counted”.

Risks
Data should be the foundation for the decision-making process, not a substitute for good judgment. Data must first be measured to be managed, beginning to analyse data that is incomplete may cause inaccurate outcomes. The following issues may be encountered when dealing with data:
 * Many important things cannot be quantified. The human brain is unpredictable, and as such, emotions, behaviour and decisions cannot be quantitatively anticipated.
 * Manipulation of data to fit a prerogative is known as 'Juking the Stats' and will lead to an inaccurate outcome.
 * Data generating process can be biased due to errors or human biases. For example, Google translation makes errors when translating genders from Turkish to English.
 * The past is not always a reliable indicator of the future. Collected data may not always be replicable or applicable in real-life scenarios.

Data Bias
The available data of a study may sometimes not be representative of the population or phenomenon studied. This is what is commonly defined as data bias. It also denotes when data does not include variables that properly capture the phenomenon that it is desired to predict, or when it includes content produced by humans that may contain bias against groups of people. Seven major biases are described as follows.

Sample Bias

This happens when the collected data does not accurately represent the environment the program is expected to run into. For example, if your goal is to create a model that can operate security cameras at daytime and nighttime, but train it on nighttime data only. You’ve introduced sample bias into your model.

Sample bias can be reduced or eliminated by:

By covering all the cases that the product is expected to be exposed to. This can be done by eliminating the domain of each feature and making sure we have balanced, evenly-distribution data covering all of it. Otherwise you'll be faced by inaccurate results and data the does not match reality. With the above example, you can do this by training the model on both the daytime and nighttime environments.

Exclusion Bias

This occurs when some aspects of the data are excluded from analysis as they're considered irrelevant to the research. For example, if analysing the population's consumption of a good, one may neglect the proximity of the respondent to the city enter. However, the proximity to the CBD may indicate property value and median income, which may affect consumption habits.

Exclusion bias can be reduced or eliminated by:

Investigating before discarding features by doing sufficient analysis on them. By getting a third party to investigate, can help get a fresh perspective on the data. If low on time/resources and need to cut your dataset size by discarding feature(s). Before deleting any, make sure to search the relation between this feature and your label. Most probably you’ll find similar solutions, investigate whether they’ve taken into account similar features and decide then. For the above example, you can do this by taking the time to analyse multiple relations before discarding data sets.

Observer Bias

This occurs when the cognitive biases of researcher's cause them to subconsciously influence the participants of an experiment. For instance, in the assessment of medical images, one observer might record an abnormality, but another might not. Different observers might tend to round up or round down a measurement scale. Color change tests can be interpreted differently by different observers.

Observer bias can be reduced or eliminated by:

Ensuring that observers are well trained, making sure behaviours are clearly defined. Making sure behaviours are clearly defined and have clear rules and procedures in place for the experiment.

Prejudice Bias

This is an unjustified or incorrect attitude (usually negative) towards an individual based solely on the individual's membership of a social group such as appearances, class, status, gender or sexuality. For example, having a negative attitude towards people who are not born in Australia due to their status of being a foreigner.

Prejudice Bias can be reduced or eliminated by:

Ignoring the statistical relationship between group x and the sampling data set, for example, between gender and occupation.

Measurement Bias

A measurement process is biased if it systematically overstates or understates the true value of the measurement. This occurs when distortion happens due to an issue with the device or tool used to observe or measure data. For example, a scale may be properly calibrated but give inconsistent weights (sometimes too high, sometimes too low).

Measurement Bias can be reduced or eliminated by:

Having multiple measuring devices so that if one so happens to be inaccurate, there are devices there. Or by hiring individuals who are trained to compare the output of the devices.

Confirmation Bias

It is a type of cognitive bias that involves favouring information that confirms your previously existing beliefs or biases. For example, a person who holds a belief that left-handed people are more creative than right-handed people and so will use whatever data possible that confirms this belief.

Confirmation Bias can be reduced or eliminated by:

Making sure that observes are well trained to acknowledge that they are applying their own confirmation bias to a data set. Also, by having multiple people analyse the data set means that any confirmation biases can be checked by others.

Outliers

An outlier is an extreme data value. Usually, it is easy to spot because the value is either very high or very low as compared to the overall average. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. It is risky to make a conclusion without making a side note if there are anomalies involved, as it could result in misleading data

Summary
The key takeaways from this topic are as follows:

Data Science Big Data Machine Learning Data Visualisation Risks and Biases
 * The study of extracting knowledge from data
 * Large volumes of data are collected by organisations for several purposes
 * This data is then analysed by data scientists and developed into a usable format for the business
 * Develops mathematical models to make predictions on future data
 * Highly reliant on Artificial Intelligence (AI)
 * Data can be communicated visually to a viewer
 * Over reliance on data not a substitute for good judgement in the decision making process
 * Bias in data sets may occur in multiple forms

Reference/s
https://learn.uq.edu.au/bbcswebdav/pid-4902574-dt-content-rid-21747911_1/courses/ECON3430S_6960_61095/Week13_Lecture_ECON3430_2019.pdf https://towardsdatascience.com/5-types-of-bias-how-to-eliminate-them-in-your-machine-learning-project-75959af9d3a0 https://analyticstraining.com/makes-data-big/ https://cmotions.nl/en/5-typen-bias-data-analytics/

Maneth & Poulovassilis,2017,Data Science,The Computer Journal, Volume 60, Issue 3, March 2017, Pages 285–286

Kirkos,Spathis & Manolopoulos,2010,Audit‐firm group appointment: an artificial intelligence approach, Intelligent Systems in Accounting, Finance and Management

The Future Of Life Institute FLI, 2019,AI Policy- Japan

Fleming, 2018,n this Tokyo cafe, the waiters are robots operated remotely by people with disabilities, World Economic Forum