Big Data


The increased use of technology in society is creating such a large volume of data, making it a highly complex issue


What Is Big Data?

Big Data describes the massive volume of both structured and unstructured data that is difficult to process using traditional databases and software techniques. - Vangie Beal

Examples

Just from your web browser alone, data on every action you take can be collected. Here are two ways you can view the data that might be collected and used.


1. View Browser History

Mac: ⌘ + Y , Win: Ctrl + Y

Every webpage you visit when browsing the internet is stored and saved here. Webpages you visit and when you visit them are all recorded. (Unless you're in private browsing mode)

2. View Browser Console

Mac: ⌘ + Option + J , Win: Ctrl + Shift + J

Think of the browser console as the Terminal application for your computer. Any activity can be logged to the browser console and inevitably stored. Keep it open whilst browsing this website to see every action you take unfold in real time. (Investigate a log to see the metadata attached to each event)

Problems
of Big Data

Technical Problems:

The amount of data is so immense that traditional strategies in data processing are no longer effective and cause problems in the following areas:

Analysis

Data analysis is the process of investigating datasets in order to discover useful information which can support decision-making. Data analysis can lead to descriptive and predictive findings. Analysing data is a complex activity which faces barriers, such as, data scientists confusing facts and opinions or becoming affected by cognitive bias. As the size of data grows, so does the complexity in analysing it.

Capture

Data can be captured through multiple sources. These sources may be technological products and services, documents, webpages or even sensors throughout the world, such as, speed cameras, wearable technologies and satellites. Data capture is so prevelant in our society that it has caused Big Data.

Search

Searching through data is an arduous task which involves finding information needs within a data set. Google's search engine does a good job when searching through webpages, however, how do we solve the problem of searching through large amounts of structured and unstructured data?

Sharing

Data Sharing is the ability to utilise data owned by others. It may involve the sharing of business, academic or scientific data. The ethical problem of choosing to share data and how to share data usually resides with the owner. The sharing of data further adds to the complexity of Big Data.

Storage

As the size of data grows, so does the capability to store it. New technologies, such as, tapes, CD's, USBs and harddrives have been invented and are becoming smaller in physical size, and are able to hold larger amounts of data. Large amounts of data required the creation of new technologies which are becoming more affordable as technology advances.

Transfer

Data transfer is the ability to send digital information from one computer to another. Inventions such as copper wires, optical fibres and wireless technologies have enabled people to send data through networked systems. Big Data is faced with the need for technology to quickly and securely send large amounts of data across network channels.

Visualisation

The ability to visualise data clearly and concisely is a powerful tool of communication. Visualisation is a tool that can be used to extract meaning from large amounts of data, therefore, unravelling its complexity. This topic is elaborated further in the visualisation course theme.

Privacy

The use of IT devices, products and services generates personal data which is collected and processed by organisations. Data on health, location, finances and other personal information is digitised and stored. This topic is further elaborated in the opinion editorial and also in the privacy and ignorance course themes.

Security

Data can never be 100% secure. Securing data is becoming more difficult as more data is generated by businesses. The main motivation for stealing data is to commit corporate espionage, to profile high-ranking individuals and to steal financial credentials. In order to mitigate these risks, organisations need to secure their data which is a dynamic problem.

Benefits
of Big Data

Benefits

The concept of Big Data also provides benefits from multiple perspectives.

Cost Reductions

a business perspective

Utilising big data technologies, such as, Hadoop and Cloud Computing can save businesses billions of dollars. Rather than maintaining data warehouses, servers and infrastructure themselves, businesses are able to offload this burden to service providers. Advantages of such service providers (such as Software as a Service and Infrastructure as a Service) include the scalability and flexibility of infrastructure as business data grows.

Improved Decision Making

a business perspective

The analysis of large amounts of data enables businesses to make more improved business decisions. Further, these decisions can enable the optimisation of business processes which enables businesses to become more profitable. An example of businesses utilising new sources of data is the health insurance company United Healthcare who "uses natural language processing tools from the Software Analytics System (SAS) to better understand customer satisfaction. By converting records of customer voice calls into text, the company is able to search for indications that the customer is dissatisfied. The company has already found that the text analysis improves its predictive capability for customer attrition models". (Source)

New and Improved Products and Services

a business perspective

The analysis of data on user's behaviour when using existing systems enables businesses to create new and improved products and services. Similar to the iteration loop spoken about in the Evolution Course theme, businesses are able to iteratively create better products by utilising data findings. The increased consumption and analysis of data results in a higher understanding of user behaviour, which therefore enables businesses to create more valuable products and services.

Targeted Advertising

a business perspective

Another benefit of analysing data on user's behaviour is to deduce consumer interests. By knowing what consumers like and dislike, and having a statistical understanding of the relationships between different products, businesses are able to create advertisements which take advantage of the consumer interests, and therefore result in more sales and profitability for the business.

Mitigation of Risk

a business perspective

Data analysis can result in the mitigation of risk to the business. By developing a better understanding of the business' processes through data, improved decisions can be made regarding risk management and the development of risk mitigations strategies.

Job Opportunities

an interdisciplinary perspective

Businesses are beginning to realise the importance of Big Data and the ability to analyse data to make more informed decisions. As a result there is a high demand for specific skillsets, which therefore leads to the creation of employment opportunities for more people. Dhiraj Rajaram believes that an interdisciplinary skillset of business, mathematics, technology and behavioural sciences characterises an ideal employee for data driven organisations. The growth of Big Data combined with the current skill shortage in this area will result in an increase in employment opportunities for individuals in the future. (Source)

The Development and Progression of Technology

an engineering perspective

As the size of data grows, the requirement for technology to solve the problems associated with Big Data also increase exponentially. Technologies such as Hadoop, Cloud Computing, Hive and Spark have all been developed in an attempt to solve the problems associated with security, storage and analysis of Big Data. Future innovative developments will continue to occur in the future.

Improved Artificial Intelligence

an engineering perspective

Artificial Intelligence (AI) is the "theory and development of computer systems able to perform tasks normally requiring human intelligence, such as, visual perception, speech recognition, decision-making, and translation between languages". (Source) From a technological perspective, AI is seen as an opportunity to understand and solve problems outside the realms of human capacity. To illustrate, a recent example is the AlphaGo AI who beat a human in the Chinese board game Go. The AI is able to learn from watching thousands of Go games (large amounts of data) in order to reach a optimal method of playing the game. This feat was believed to be impossible until the year ~2020. (Source) Further, the invention of autonomous self-driving cars also utilise AI. As AI is able to learn from and process large amounts of collected data, perhaps computers and robots can perform the majority of human tasks. This then gives humans the ability to focus more time on what they find enjoyable.

Course Themes

Drag each theme to ensure no edges cross or Click a theme to read more about it!

Identity

During a recent police investigation, Chief Inspector Stone was interviewing five local villains to try and identify who stole Mrs Archer's cake from the mid-summers fair. Below is a summary of their statements:

Arnold:
it wasn't Edward
it was Brian

Brian:
it wasn't Charles
it wasn't Edward

Charles:
it was Edward
it wasn't Arnold

Derek:
it was Charles
it was Brian

Edward:
it was Derek
it wasn't Arnold

It was well known that each suspect told exactly one lie. Can you determine who stole the cake?

Your Answer:

Source

Identity Part 1

Hopefully, the previous slide shows how collected information and logical deduction can identify human behaviour, and thereby illustrates how organisations and governments utilise collected data.


The opinion editorial discussed the flaws associated with the concept of identity in the digital world. People have the ability to take on any identity they desire, or even multiple identities, in the digital world. This is illustrated in the use of avatars, chat sites, dating sites, forums and online gaming to name a few.


As people use more and more digital devices which are connected, such as phones, ipads, and tracking devices, organisations are collecting more and more personal information on a user's identity and on their personal behaviour. As this personal information is collected and stored, it becomes valuable big data, which then becomes a major target for hackers. All online users and shoppers are vulnerable to fraud and identity theft. Hackers use sophisticated methods to obtain personal information and steal user identities mainly for personal financial gain. Organisations attempt to mitigate risk of attacks but seem to be always lagging in their attempt to defend against hackers.


Organisations attempt to anonymise the collected personal data in order to protect the privacy and sensitivity of the information being utilised. However, encrypting, coding and even the removal of personal information, such as names and credit card numbers, does not guarantee the anonymity of information in data sets. There are many other identifiers that can be used to re-identify the user and reveal their personal identity, dimensions such as location, transactions, store names, item prices and age can all be used. During the guest panel, Professor Joan Beaumont discussed the process of identity construction. She stated that the process involves answering the questions of Who, When, Where and Why? This process can be applied by utilising the dimensions mentioned above and is used to construct and de-anonymise individuals through datasets. Anonymity is a myth in the digital age due to the availability and publication of multiple data sets generated from consumers using multiple products and services. Data scientists have been able to re-identify users from anonymised data, such as Netflix, credit card transactions and the Genome Project. Therefore, Big Data can never uphold complete anonymisation, but merely contains de-identified information.


Identity Part 2

Following the continuous study on the complexity of Big Data throughout the semester, it was interesting to note an article titled 'I would not trust ABS with my personal data, former employee says' in The Canberra Times, dated 21 may 2016. The article stated how the Australian Bureau of Statistics (ABS) wants to retain personal identifying information from the 2016 Census of Population and Housing, which won't be overridden or destroyed until 2020. Since 1961, all identifying information was previously destroyed once the important data had been saved. This scenario poses a privacy and security issue but the ABS is adamant that they will not share any 'identifiable, private or confidential data'. However, I do not believe that this information will remain unshared between government agencies and further the '110 year old legal safeguards' mentioned seem outdated and may not be effective in securing the personal data.


My thoughts are that there is a lot of effort required in de-anonymising specific datasets, but that it is possible and is being done, as referenced above. As big data grows, more research and resources will be needed to anonymise the mounds of personal information within data sets, and no doubt more improved ways to re-identify users will also appear. I believe that further development and implementation of policies and strategies to deal with identity issues should be implemented within organisations in order to protect the identity of users and consumers.


References:

Your Genome Could Reveal Your Identity Your identity is Not Anonymous Re-identifying Anonymous People with Big Data

Steps to Reduce Online Surveillance:

  1. 1. Download and install Adblock
  2. 2. Dont connect to public wifi networks
  3. 3. Be careful of malware and phising attempts
  4. 4. Turn off your wifi
  5. 5. Turn off your mobile
  6. 6. Don't use a wearable technology
  7. 7. Turn off your computer
  8. 8. Dont ever use a computer again.

Ignorance

Have you ever witnessed messages on your web browser such as this?
Go on, click them...

Ignorance Part 1

Hopefully, the previous slide illustrates the targeted marketing attempts that companies use as a strategy for selling products and services. By collecting personal data on individuals, companies are able to extrapolate information and by determining their interests of the individual, they target their advertising towards them.


Ignorance is defined as a lack of knowledge or information. Living in an era of Big Data gives rise to the collection and storage of vast amounts of personal information on every individual. As the collection of data is organised and analysed, more information is produced, and re-combinations of data generate further new information and knowledge. This results in information overload and in ignorance, as "our capacity to manage and comprehend information is not keeping pace with [the] seemingly exponential growth" of Big Data. (Source: Ignorance, Forgetting and Unlearning. pg 101)


The opinion editorial illustrated the ignorant act of people registering for services that collect their personal information, and also discussed the ignorant acts of other people posting and providing random people's personal information and the damaging effects that those acts cause.

Ignorance Part 2

Modern technology and social media have resulted in the collection of large volumes of data on individuals which can greatly affect their daily lives, including credit ratings to job opportunities. However, most people are ignorant of the type of information and the amount of information that is available profiling them. We are ignorant of the fact that when we are buying our groceries and paying by credit card, or using loyalty cards like Flybuys, that information is being collected about us. Not only do businesses know where we shop, what we buy, but they are also producing information about our eating habits and our lifestyles. Therefore, the saying "ignorance is bliss" stands true for the majority of individuals in relation to the type and amount of personal information collected on them by businesses. However, this saying is not true from the perspective of businesses as businesses need the right information at the right time in order to target consumers, facilitate decision-making and to stay competitive. Further, most people, even some CEOs of organisations, are ignorant on how big data is collected, stored and processed, and on the technologies used to process big data, such as Hadoop, Hive, and Spark.


Lastly, a large number of individuals are concerned about privacy issues relating to their personal information. However, as consumers value the benefits of using products and services, they become less concerned about the privacy issues. Individuals are therefore deliberately choosing to remain ignorant of the ways their personal information is being utilised as the perceived value and benefits of products and services outweigh their privacy concerns. (Source)


I believe that the ignorance of consumers benefits corporations who want to hide the way they conduct their business and do not want to disclose strategies to competitors. Another realisation for consumer ignorance is that "people can't handle the truth" and by letting the general public remain ignorant, businesses will encounter less resistance on how their business is being conducted. Therefore, it is in their best interest for companies to continue hiding business methods and strategies and keeping consumers ignorant.

Systems Thinking

Systems Thinking

The principles of Systems Thinking are an important aspect, and an essential tool, in assessing Big Data due to the vast volume of data and the difficulty in analysing and utilising the data.


By utilising a Systems Approach to Big Data, we can separate Big Data into components in order to understand its degrees of complexity. The volume of data is a small issue; the relationships between data is a bigger issue; and the biggest issue is the rate of interconnections between the volume, variety and ambiguity of data.


Systems Thinking can be effectively applied to Big Data to devise different ways to process Big Data, develop theories and models, and to define algorithms needed to analyse Big Data. Breaking systems down into classes, subclasses, elements and relationships enables a better understanding of how the entire system works, and provides insight and clarity into solutions. A Systems Thinking approach will enable organisations to find true value and benefit from the immense volumes of collected data from multiple sources.


Upon reflecting on the topic of Systems Thinking, I believe that it is an invaluable tool used by data scientists in finding ways to analyse and utilise Big Data. However, given that the volume of Big Data is growing at such a rapid pace, in the future, business organisations will benefit by using a Systems Thinking approach in training all permanent employees in data analytics, rather than merely relying on contracted data scientists, as there will be a massive shortage of people with data analytic skills in the future. Organisations will be more productive and competitive by training every employee on how to collect, disseminate, compare and utilise data.


References:

Harnessing Big Data with a Systems Thinking Approach - (A Harley Davidson Case Study) The Correlation between Big Data and Systems Thinking Data Analytics - Systems Thinking

Visualisation

Visualisation Part 1

Hopefully, the previous slide highlights the various methods of interpreting the same dataset, and also illustrating four visualisation techniques that currently exist, such as, the scatter plot, bar chart, radar chart and pie chart.


The two advantages of visualising data are, firstly, the ability to understand meaning from data and secondly, the ability to communicate this understanding to another person. With the invention of the two-dimensional coordinate system by Rene Descartes in the 17th centuary, humans were able to visualise information in a way that was intuitive, clear, accurate, and efficient. (Source) Experimenting with visualisation gave rise to data visualisation techniques, such as, the bar graph, histogram, pie chart and scatter plot. Further, with Big Data exceeding far beyond two dimensions, these techniques can still be applied in achieving an understanding when faced with large sets of data.


Guest speaker, Julie Brooke's spoke about her use of art as a medium for portraying and visualising her emotions. I find it interesting that the visualisation of data is an attempt to do the opposite - to take a set of metrics and create emotion and meaning from them. Personally, I feel that visualisation is extremely important in the way we approach Big Data, especially as a way of communicating the results of data analysis. I feel that a strong emphasis should be placed on how the data is aesthetically presented to the viewer. By having data that is well presented, viewers might subconciously become more familiar with a visualisation, which might therefore result in a better understanding. Achieving an aesthetic visualisation of data requires the cooperation of data scientists, statisticians, graphic designers, artists and software engineers. Human Computer Interaction is the field of research on how humans interact with computers. Perhaps research into how humans percieve colours, fonts and sizes of elements on a screen might inform data visualisation in the future.


Data visualisation as a method of understanding large amounts of data has resulted in the invention of multiple digital and graphical technologies such as D3.js, Statistical Analysis System and R (which is a tool I personally used in COMP3420 - Advanced Databases and Data Mining). Many more open source technologies exist and are used to visualise structured or unstructed data. Unique visualisation tools even exist for visualising social networks, such as, Facebook (pictured on the next slide).

Visualisation Part 2

Network Visualisation of Facebook Users around the world.


Data visualisation is an important tool which enables organisations to make correct business decisions and improve business processes. Effective visualisation of data can lead to uncovering data trends and provide previously unseen insights. I feel that data visualisation is the bridge between having data and effectively understanding data.

Abstract
Complexity

Dimensions

Joan Licata's slides on dimensions indicated that measurements can be viewed in the first, second or even third dimensions. However, if you thought that was difficult to contemplate, with Big Data, the number of dimensions can be any natural number i.e. 1000's, as taught in COMP3420 - Advanced Databases and Data Mining. Data warehouses (DW) are central repositories of integrated data from one or more disparate sources. Data Warehouses store current and historical data and are used to create analytical reports for business professionals. (Wikipedia) A reported approach to storing data in data warehouses is termed the Star Schema or Dimensional Model. In a dimensional model, transaction data is partitioned into "facts", and "dimensions" - where the dimensions reference information that gives context to the facts. (Source)

The star schema is an important method of storing data into DWs as it decreases the query time of finding information within the DW. This shows how choosing dimensions that best describe data is a successful method of comprehending and communicating data from large disparate data sets. Further, the effort of analysing Big Data is extremely dependent on the dimensions chosen. Dimensions chosen only show a small part of the big picture,and may completely ignore insight that could be generated from other dimensions chosen. According to the inventor of the Star Schema, business analytics were easier to communicate when using dimensions and enabled business-focussed individuals to better understand the data in front of them.

Dimensions Describing Big Data

Volume, Velocity and Variety


The three dimensions that best describe Big Data:


Volume - the amount of data
Velocity - the speed in which data is generated
Variety - the types of data and combinations of data formats and schematics eg. text, images, structured and unstructured

This illustrates that dimensions not only have an affect on communicating and understanding large data sets but can also be a useful method of communicating and understanding the concept of big data itself!

Economic
Complexity

Economic Complexity

Economic complexity relates to a country's trade, products produced, growth opportunities, analysis of economic growth and the development of countries. Big data can increase our knowledge immensely and can provide useful information in many fields, including economic complexity. It can provide data on resources, products, forecast GDP growth, determine the economic performance of countries and predict their future development. However, the amount of data collected is overwhelming and has usually been collected for different purposes. There is no infinite dataset collected that will provide all the valuable information and meaningful answers to our complex questions. The task of extracting meaningful and useful information from multiple datasets requires complex networks, new algorithms and the development of big data visualization engines. It is a timely process which requires scientific analysis and is very labour intensive. However, not only does big data provide employment opportunities for technical staff, such as data scientists, software engineers, and data processes, but big data also provides useful information on Economic Complexity to investors, entrepreneurs, policymakers and the public.


Reflecting on the topic of Economic Complexity, brings back recollections of, the guest speaker, Rob Gray's presentation on the New Income Management in Northern Territory (NIM). This program was introduced in 2010 in order to improve the lives of Aboriginal people and to protect Aboriginal children. Long term unemployed Aboriginals were issued with Basicscards which were to be used to purchase necessary basic commodities, excluding alcohol, tobacco, gambling, pornography and cash withdrawals. The collection and analysis of the data proved that the program failed to address any fundamental problems and to economically improve the lives of the targeted group. This example highlighted the use of big data to obtain meaningful socio-econoic information, but it also showed that data does not always provide the right answers to complex problems.


References: Luciano Pietronero, Rob Bray Panel Slides

The End

Created by Daniel Pekevski

Powered by Bootstrap, OnePage, Chart.js and Arbor.js