This post discusses accountability, ethics and professionalism in data science (DS) practice, considering the demands and challenges practitioners face. Dramatic increases in the volume of data captured from people and things, and the ability to process it places Data Scientists in high demand. Business executives hold high hopes for the new and exciting opportunities DS can bring to their business, and hype and mysticism abounds. Meanwhile, the public are increasingly wary of trusting businesses with their personal data, and governments are implementing new regulation to protect public interests. This paper asks whether some form of professional ethics can protect data scientists from unrealistic expectations and far reaching accountabilities.
Demand for DS skills is off the charts, as Data Scientists have the potential to unlock the promise of Big Data and Artificial Intelligence.
As much of our lives are conducted online, and everyday objects are connected to the internet, the “era of Big Data has begun.”(boyd & Crawford 2012). Advancements in computing power, and cheap cloud services mean that vast amounts of digital data are tracked, stored and shared for analysis (boyd & Crawford 2012), and there is a process of “datafication” as this analysis feeds back into people’s lives (Beer 2017).
Concurrently, Artificial Intelligence (AI) is gaining traction through successful use of statistical machine learning and deep learning neural networks for image recognition, natural language processing, and games and dialogue question and answer (Elish & boyd 2017). AI now permeates every aspect of our lives in chatbots, robotics, search and recommendation services, automated voice assistants and self-driving cars.
Data is the new oil, and Google Amazon Facebook and Apple (GAFA) are in control of vast amounts of it. Combined with their network power, this results in super normal profits: US$25bn net profit amongst them in the first quarter of 2017 alone (the Economist 2017). Tesla, which made 20,000 self-driving cars in this time, is worth more than GM which sold 2.5m (the Economist 2017).
Furthermore, traditional industries such as government, education, healthcare, financial services, insurance, retailers, and functions such as accounting, marketing, commercial analysis and research who have long used statistical modelling and analysis in decision making are harnessing the power of Big Data and AI which supplements or replaces “complex decision support in professional settings (Elish & boyd 2017).
All these factors drive incredible demand from organisations, and results in a shortage of supply of Data Scientists.
With this incredible appetite for and supply of personal data, individuals, government, and regulators are increasingly concerned about threats to competition (globally), personal privacy and discrimination, as DS, algorithms and big data are neither objective or neutral (Beer 2017) (Goodman & Flaxman 2016). These must be understood as socio technical concepts (Elish & boyd 2017), and their limitations and shortcomings well understood and mitigated.
To begin with, the process of summarizing humans into zeros and ones removes context, therefore, contrary to popular mythology about Big Data, the larger the data set, the harder it is to know what you are measuring (Theresa Anderson n.d.; Elish & boyd 2017). Rather, DS practitioner has to decide what is observed, recorded, included in the model, how the results are interpreted, and how to describe its limitations (Elish & boyd 2017; Theresa Anderson n.d.).
“All too often, limitations in the data mean that cultural biases and unsound logics get reinforced and scaled by systems in which spectacle is prioritised over careful consideration”. (Elish & boyd 2017)
In addition, profiling is inherently discriminatory, as algorithms sort, order, prioritise, and allocate resources in ways that can “create, maintain or cement norms and notions of abnormality” (Beer 2017) (Goodman & Flaxman 2016). Statistical machine learning scales normative logic (Elish & boyd 2017), and biased data in means biased data out, even if protected measures are excluded but correlated ones are included. Systems are not optimised to be unbiased, rather the objective is to have better average accuracy than the benchmark (Merity 2016).
Lastly, algorithms by their statistical nature are risk averse, and focus where they have a greater degree of confidence (Elish & boyd 2017; Theresa Anderson n.d.) (Goodman & Flaxman 2016), exacerbating the underrepresentation of minorities that exist in unbalanced training data (Merity 2016).
In response, the European Union announced an overhaul of their Data Protection regime from a Directive to the far reaching General Data Protection Regulation. Slated to be law by April 2018, this regulation protects the rights of individuals, including citizens right to be forgotten, and securely store their data, but also the right to an explanation of algorithmic decisions that significantly affect an individual (Goodman & Flaxman 2016). The regulations prohibit decisions made entirely by automated profiling and processing, and will impose significant fines for non-compliance.
Indeed, companies are currently reorganising themselves to protect the data assets they are amassing, reflecting the increased need for data security, ethics and accountability. Two recent additions to the Executive suite are the Chief Information Security Officer and the Chief Data Officer, who are responsible for ensuring organisations meet their legal obligations for data security and privacy.
DS practitioners must overcome many challenges to meet these demands for accountability and profit. It all boils down to ethics. Data scientists must identify and weigh up the potential consequences of their actions for all stakeholders, and evaluate their possible courses of action against their view of ethics or right conduct (Floridi & Taddeo 2016).
Algorithms are machine learning, not magic (Merity 2016), but the media and senior executives seem to have blind faith, and regularly use “magic” and “AI” in the same sentence (Elish & boyd 2017).
In order to earn the trust of businesses and act ethically towards the public, practitioners must close the expectation gap generated by recent successful (but highly controlled) “experiments-as-performances”, by being very clear about the limitations of their DS practices. Otherwise DS will be snake oil, and collapse under the weight of the hype and these unmet expectations (Elish & boyd 2017), or breach regulatory requirements and lose public trust trying to meet them.
The accountability challenge is compounded in multi-agent, distributed global data supply chains, as accountability and control are hard to assign and assert (Leonelli 2016), the data may not be “cooked with care” but the provenance and assumptions within the data are unknown (Elish & boyd 2017; Theresa Anderson n.d.).
Furthermore, cutting edge DS is not a science in the traditional sense (Elish & boyd 2017), where hypotheses are stated and tested using scientific method. Often, it really is a black box (Winner 1993), where the workings of the machine are unknown, and hacks and short cuts are made to improve performance without really knowing why these work (Sutskever, Vinyals & Le 2014).
This makes the challenge of making the algorithmic process and results explainable to a human almost impossible in some networks (Beer 2017).
Lastly, the social and technical infrastructure grows quickly around algorithms once they are out in the wild. With algorithms powering self-driving cars and air traffic collision avoidance systems, ignoring the socio-technical context can have catastrophic results. The Überlingen crash in 2002 occurred because there was limited training on what controllers should do when they disagreed with the algorithm (Ally Batley 2017; Wikipedia n.d.). Data scientists have limited time and influence to get the socio technical setting optimised before order and inertia sets in, but the good news is that the time is now, whilst the technology is new (Winner 1980).
Indeed, the opportunities to use DS and AI for the betterment of society are vast. If data scientists embrace the uncertainty and the humanity in the data, they can make space for human creative intelligence, whilst at the same time respecting those who contributed the data, and hopefully create some real magic (Theresa Anderson n.d.).
So how can DS practitioners equip themselves to take on these challenges and opportunities ethically?
Historically, many other professions have formed professional bodies to provide support outside of the influence of the professional’s employer. The members sign codes of ethics and professional conduct, in vocations as diverse as designers, doctors and accountants (The Academy of design professionals 2012; Australian Medical Association 2006; CAANZ n.d.).
Should DS practitioners follow this trend?
“A profession is a disciplined group of individuals who adhere to ethical standards and who hold themselves out as, and are accepted by the public as possessing special knowledge and skills in a widely recognised body of learning derived from research, education and training at a high level, and who are prepared to apply this knowledge and exercise these skills in the interest of others. It is inherent in the definition of a profession that a code of ethics governs the activities of each profession. Such codes require behaviour and practice beyond the personal moral obligations of an individual. They define and demand high standards of behaviour in respect to the services provided to the public and in dealing with professional colleagues. Further, these codes are enforced by the profession and are acknowledged and accepted by the community.” (Professions Australia n.d.)
The central component in every definition of a profession is ethics and altruism (Professions Australia n.d.), therefore it is worth exploring professional membership further as a tool for data science practitioners.
Current state of DS compared to accounting profession
Let us compare where the nascent DS practice is today with the chartered accountant (CA) profession. The first CA membership body was formed in 1854 in Scotland (Wikipedia 2017a), long after double entry accounting was invented in the 13th century (Wikipedia 2017b). Modern data science began in the mid twentieth century (Foote 2016), and there is as yet no professional membership body.
Current CA membership growth rate is unknown, but DS practitioner growth is impressive. In 2016, there were 2.1M licensed chartered accountants, (Codd 2017), not including unlicensed practitioners such as bookkeepers, or Certified Practicing Accountants. IBM predicts there will be 2.7M data scientists by 2020 (Columbus n.d.; IBM Analytics 2017), predicting 15% growth annually.
The standard of education is very high in both professions, but for different reasons. Chartered Accountants have strenuous post graduate exams to apply for membership, and requirements for continuing professional education (CAANZ n.d.).
DS entry levels are high too, but enforced by competitive forces only. Right now, 39% of DS job openings require a Masters or Ph.D (IBM Analytics 2017), but this may change over time as more and more data scientists are educated outside of universities.
The CA code of ethics is very stringent, requiring high standards of ethical behaviour and outlining rules, and membership can be revoked if the rules are broken (CAANZ n.d.) CAs must treat each other respectfully, and act ethically and in accordance with the code towards their clients and the public.
The Data Science Association has a fledgling code of conduct, but unlike CAs, membership is not contingent on adhering to this code, and there are no penalties for non-compliance (Data Science Association n.d.).
There is another reason comparison with CA profession is interesting.
Like accounting, DS is all about numbers, and seems like a quantitative and objective science. Yet there is compelling research to indicate both are more like social sciences, and benefit from being reflexive in their research practices (boyd & Crawford 2012; Elish & boyd 2017; Chua 1986, 1988; Gaffikin 2011). Also like accountants (Gallhofer, Haslam & Yonekura 2013), DS practitioners could suffer criticism for being long on practice and short on theory.
Therefore, DS should look hard at the experience of accountants and determine if, and when becoming a profession might work for them.
DS practitioners’ ethics should address three areas:
“Data ethics can be defined as the branch of ethics that studies and evaluates moral problems related to data (including generation, recording, curation, processing, dissemination, sharing and use), algorithms (including artificial intelligence, artificial agents, machine learning and robots) and corresponding practices (including responsible innovation, programming, hacking and professional codes), in order to formulate and support morally good solutions (e.g. right conducts or right values).” (Floridi & Taddeo 2016)
It is conceivable that individually, DS practitioners could be ethical in their conduct, without the large cost in time and money of professional membership.
Data scientists are very open about their techniques, code and results accuracy, and welcome suggestions and feedback. They use open source software packages, share their code on sites like GitHub and BitBucket, contribute answers on Stack Overflow, blog about their learnings and present and attend Meet Ups. It’s all very collegiate, and competitive forces drive continuous improvement.
But despite all this online activity, it is not clear whether they behave ethically. They do not readily share data as it is often proprietary and confidential, nor do they share the substantive results and interpretation. This means it is difficult to peer review or reproduce their results, and be transparent about their DS practices to ascertain if they are ethical or not.
A professional body may seem like a lot of obligations and rules, but it could provide the data scientists some protection and more access to data.
From the public’s point of view, a profession is meant to be an indicator of trust and expertise (Professional Standards Councils n.d.). Unlike other professions, the public would rarely directly employ the services of a data scientist, but they do give consent for data scientists to collect their data (“oil”).
Becoming a professional body and adopting a code of professional conduct is one way to earn public trust and the right to access and handle personal data (Accenture n.d.). It can also help pool resources (and facilitate self-employment) so it may open more doors to data scientists, and allow them to pursue initiatives that are altruistic and socially preferable (Floridi & Taddeo 2016).
Keeping ethics at the forefront of decision making actually makes for good leaders who can navigate conflict and ambiguity (Accenture n.d.), and result in good financial results (Kiel 2015).
With the growing regulatory focus on data and data security, it is foreseeable soon that CDO and CISO may be subject to individual fines and jail time penalties like Chief Executive and Chief Financial Officers are with regards to Sarbanes Oxley Act Compliance (Wikipedia 2017c). Professional membership can provide the training and support needed to keep practitioners up to date, in compliance and out of jail.
Lastly, right now, the demand for DS skills far outweigh supply. Therefore, despite the significant concentration in DS employers, the bargaining power of some individual data scientists is relatively high. However, they have no real influence over how their work is used: their only option in a disagreement is to resign. Over the medium term, supply will catch up with demand, and then even the threat of resignation will become worthless.
Steering the course of DS practice towards ethical outcomes is easiest at the outset (Winner 1980).
Accenture n.d., ‘Data Ethics Point of view’, http://www.accenture.com, viewed 12 November 2017, <https://www.accenture.com/t00010101T000000Z__w__/au-en/_acnmedia/PDF-22/Accenture-Data-Ethics-POV-WEB.pdf#zoom=50>.
Ally Batley 2017, Air Crash Investigation – DHL Mid Air COLLISION – Crash in Überlingen, viewed 20 November 2017, <https://www.youtube.com/watch?v=yQ0yBFoO2V4>.
Australian Medical Association 2006, ‘AMA Code of Ethics – 2004. Editorially Revised 2006’, Australian Medical Association, viewed 20 November 2017, <https://ama.com.au/tas/ama-code-ethics-2004-editorially-revised-2006>.
Beer, D. 2017, ‘The social power of algorithms’, Information, Communication & Society, vol. 20, no. 1, pp. 1–13.
boyd, danah & Crawford, K. 2012, ‘Critical Questions for Big Data’, Information, Communication & Society, vol. 15, no. 5, pp. 662–79.
CAANZ n.d., ‘Codes and Standards | Member Obligations’, CAANZ, Text, viewed 20 November 2017, <http://www.charteredaccountantsanz.com/member-services/member-obligations/codes-and-standards>.
Chua, W.F. 1988, ‘Interpretive Sociology and Management Accounting Research- a critical review’, Accounting, Auditing and Accountability Journal, vol. 1, no. 2, pp. 59–79.
Chua, W.F. 1986, ‘Radical Developments in Accounting Thought’, The Accounting Review, vol. LXI, no. 4, pp. 601–33.
Codd, A. 2017, ‘How many Chartered accountants are in the world?’, quora.com, viewed 20 November 2017, <https://www.quora.com/How-many-Chartered-accountants-are-in-the-world>.
Columbus, L. n.d., ‘IBM Predicts Demand For Data Scientists Will Soar 28% By 2020’, Forbes, viewed 20 November 2017, <https://www.forbes.com/sites/louiscolumbus/2017/05/13/ibm-predicts-demand-for-data-scientists-will-soar-28-by-2020/>.
Data Science Association n.d., ‘Data Science Association Code of Conduct’, Data Science Association, viewed 13 November 2017, </code-of-conduct.html>.
Elish, M.C. & boyd, danah 2017, Situating Methods in the Magic of Big Data and Artificial Intelligence, SSRN Scholarly Paper, Social Science Research Network, Rochester, NY, viewed 19 November 2017, <https://papers.ssrn.com/abstract=3040201>.
Floridi, L. & Taddeo, M. 2016, ‘What is data ethics?’, Phi.Trans.R.Soc.A, no. 374:20160360.
Foote, K.. 2016, ‘A Brief History of Data Science’, DATAVERSITY, viewed 21 November 2017, <http://www.dataversity.net/brief-history-data-science/>.
Gaffikin, M. 2011, ‘What is (Accounting) history?’, Accounting History, vol. 16, no. 3, pp. 235–51.
Gallhofer, S., Haslam, J. & Yonekura, A. 2013, ‘Further critical reflections on a contribution to the methodological issues debate in accounting’, Critical Perspectives on Accounting, vol. 24, no. 3, pp. 191–206.
Goodman, B. & Flaxman, S. 2016, ‘European Union regulations on algorithmic decision-making and a ‘right to explanation’’, arXiv:1606.08813 [cs, stat], viewed 13 November 2017, <http://arxiv.org/abs/1606.08813>.
IBM Analytics 2017, ‘The Quant Crunch’, IBM, viewed 20 November 2017, <https://www.ibm.com/analytics/us/en/technology/data-science/quant-crunch.html>.
Kiel, F. 2015, ‘Measuring the Return on Character’, Harvard Business Review, viewed 13 November 2017, <https://hbr.org/2015/04/measuring-the-return-on-character>.
Leonelli, S. 2016, ‘Locating ethics in data science: responsibility and accountability in global and distributed knowledge production systems’, Phil. Trans. R. Soc. A, vol. 374, no. 2083, p. 20160122.
Merity, S. 2016, ‘It’s ML, not magic: machine learning can be prejudiced’, Smerity.com, viewed 19 November 2017, <https://smerity.com/articles/2016/algorithms_can_be_prejudiced.html>.
Professional Standards Councils n.d., What is a profession? | Professional Standards Councils, viewed 19 November 2017, <https://www.psc.gov.au/what-is-a-profession>.
Professions Australia n.d., What is a profession?, viewed 21 November 2017, <http://www.professions.com.au/about-us/what-is-a-professional>.
Sutskever, I., Vinyals, O. & Le, Q.V. 2014, ‘Sequence to Sequence Learning with Neural Networks’, arXiv:1409.3215 [cs], viewed 4 November 2017, <http://arxiv.org/abs/1409.3215>.
The Academy of design professionals 2012, ‘The Academy of Design Professionals – Code of Professional Conduct’, designproacademy.org, viewed 13 November 2017, <http://designproacademy.org/code-of-professional-conduct.html>.
the Economist 2017, ‘The world’s most valuable resource is no longer oil, but data’, The Economist, 6 May, viewed 19 November 2017, <https://www.economist.com/news/leaders/21721656-data-economy-demands-new-approach-antitrust-rules-worlds-most-valuable-resource>.
Theresa Anderson n.d., Managing the Unimaginable, viewed 19 November 2017, <https://www.youtube.com/watch?v=YEPPW09qpfQ&feature=youtu.be>.
Wikipedia 2017a, ‘Chartered accountant’, Wikipedia, viewed 21 November 2017, <https://en.wikipedia.org/w/index.php?title=Chartered_accountant&oldid=810642744>.
Wikipedia 2017b, ‘History of accounting’, Wikipedia, viewed 21 November 2017, <https://en.wikipedia.org/w/index.php?title=History_of_accounting&oldid=810643659>.
Wikipedia 2017c, ‘Sarbanes–Oxley Act’, Wikipedia, viewed 21 November 2017, <https://en.wikipedia.org/w/index.php?title=Sarbanes%E2%80%93Oxley_Act&oldid=808445664>.
Wikipedia n.d., Überlingen mid-air collision – Wikipedia, viewed 20 November 2017, <https://en.wikipedia.org/wiki/%C3%9Cberlingen_mid-air_collision>.
Winner, L. 1980, ‘Do Artifacts Have Politics?’, Daedalus, vol. 109, no. 1, pp. 121–36.
Winner, L. 1993, ‘Upon Opening the Black Box and Finding It Empty: Social Constructivism and the Philosophy of Technology’, Science, Technology, & Human Values, vol. 18, no. 3, pp. 362–78.