A Human’s guide to testing AI
What does it mean to test AI? If you look at the spam detection feature in your email, it sometimes works, less often doesn’t. Even if you flag items as not spam, some again show up in spam after some time. You could categorize that as a machine learning FAIL?
Amazon recommends books related to your purchase. Some of them make sense. Others don’t. Hopefully the algorithm learns and gets better over time. Are the incorrect recommendations a failure of the algorithm? Isn’t it expected that recommendations will learn and improve over time? So, failures are in fact not failures?
What does it really mean to test AI?
Do you test the way the algorithm ‘learns’ and improves? You could test with real world data. Are you testing the math? Can you test the math?
What is ‘AI’?
Providing a concise definition of AI (artificial intelligence) may be excruciating. The quote below is good enough context for this article (from a Human’s Guide to Machine Intelligence).
AI involves enabling computers to do all the things that typically require human intelligence, including reasoning, understanding language, navigating the visual world, and manipulating objects. Machine learning is a subfield of AI that gives machines the ability to learn (progressively improve their performance on a specific task) from experience-the aptitude that underlies all other aspects of intelligence. If a robot is as good as humans at a variety of tasks but is unable to learn, it will soon fall behind. For that reason machine learning is, arguably, one of the most important aspects of AI.
In the rest of this article, I use the term, ‘AI Software’. For the most part, the problems I describe are approached using machine intelligence. These algorithms use historical data to recognize patterns. Emails with text promising lottery wins are probably unsolicited spam.
Validate your hypothesis
Algorithms and the choices developers make have real world consequences. What is the purpose of a recommendation engine on Amazon? When you make purchases, recommendation engines will display related items. To test a recommendation engine, you may validate the relevance of the recommended items. That in itself is a very challenging problem. However, recommendation engines have a broader role.
An expectation of recommendation engines is they promise that obscure titles will become popular. As a tester, how will you validate that hypothesis? You can validate your hypothesis by using a control group which isn’t exposed to recommendations. Then compare the actual users with the control group. Do users purchase more of the recommended items (compared to the control group)?
Can you consider other factors? How does the purchasing behaviour change across all products? Can you look at market share as well as absolute sales? Do recommendations get to purchase products? Or is it the same users who are purchasing the same type of products? How will recommendation engines work, when books don’t have sufficient purchasing history? There are many more questions you can ask which can lead to further experiments.
Algorithm impact on user behavior
Algorithms can impact how people behave in the real world. That seems obvious with social media apps. How do you measure the influence of algorithms on social media apps?
You could tweak an algorithm to show a different set of choices. Then measure user behaviour. In a research study, Facebook users were shown more hard-hitting news. These voters had a higher voter turnout.
You could alter recommendations shown to users along with a control group. Then measure user behaviour. In another study, Facebook users were shown more positive and more negative posts. User’s subsequent posts reflected the emotion of their news feed.
A Human’s guide to Machine Intelligence
In the last few years, AI FAILS have been a novelty mixed with indignation. It’s easy to find examples of failures in the news. In contrast, the book, ‘ A Human’s guide to Machine Intelligence’, is a very readable account of algorithms fail. In this blog post, I’ve posed some of the questions, from the book, that you might ask if you were evaluating algorithms as a software tester. The book has detailed analysis and explanations of how algorithms fail and the background on some of these questions. The focus of the book is not on the math, but how algorithm design choices can cause problems in the real world.
As a software tester, it’s better to focus on strategies to test AI, or personal accounts of how you have tested AI algorithms.
What is testing AI?
Testing starts with an observation or an idea. You then validate your idea by conducting an experiment.
- Facebook researchers have a hypothesis that social media posts influence user behavior.
- A software developer notices that businesses are not getting good SEO rankings.
- A user complains that his email suddenly flags messages as spam.
These observations are followed by an experiment.
The approach to testing AI is probably no different from testing other software. When testing a calculator, I wonder whether it assumes numbers are entered in a particular format. Does it accept ‘.03’ without the leading zero (0)? If it doesn’t, does it matter? I can test out different options to infer what the calculator does.
This does not mean testing AI is straightforward or even possible in your role as a tester. Testers or developers may not be part of the group conducting tests with users. On the other hand, in smaller teams any team member may be able to ask questions. They may be part of the group which reviews experiments with users.
When working with complex algorithms, especially those that use real world data, you need to think about side effects or unanticipated consequences.
In a, now, well-known problem, when using autocomplete on a search engine, users are prompted with common prejudices. Entering, ‘women should’, may display suggestions such as ‘women should stay at home’. A search engine’s auto-complete may not only offer harmful suggestions, but direct users to those sites. The unintended consequence of auto-complete is that users may be led to sites which give harmful messages. A seemingly harmless enhancement such as auto-complete on a search engine can influence people’s attitudes and can impact society as a whole. (As an aside, how do you design systems to address this issue?)
When designing auto-complete or similar systems, a bigger challenge is how do you differentiate between concepts, such as porn and sexuality? Does your algorithm understand the difference between concepts, or are they just words?
On some social media sites, you are alerted about offensive language. How do you handle names or locations which may include offensive words? One way to handle the issue is to ignore proper nouns when alerting users — which itself may be a challenge. If you do allow proper nouns, how do you handle attempts to misuse the system?
Social media sites like Facebook and Linkedin create trending topics and feeds. How do you handle the presence of ‘fake news’ in the feed? Do you question the credibility of news sources? Do you question whether someone can tamper your data?
To be fair, many of these questions may not be in the purview of the development team or of software testers. However, this post should give ideas about the questions that you could ask if you can influence decisions.
Real World Data
Problems solved using AI often use a large amount of real-world data. Real-world data will have its quirks which are difficult to anticipate in the lab. You can only go so far in simulating a Facebook feed (which does not imply that you do nothing, or that there aren’t powerful alternatives).
Problems solved using AI often use social data — information related to people’s lives. Facebook and similar apps use friend details and activity and user interaction, information related to social groups. Other systems impact business, such as automated trading, or a social feed on a financial website, book recommendations or search engine rankings. Advertising systems affect consumer behavior.
In the case of auto-complete, in a search engine, you need to handle loaded topics, like race, gender, religion. You also need to consider people wanting to mislead gullible users. Image recognition is not only about pixels, but about people and their gender, race and location.
Not being able to use real world data is a major challenge for testing AI software.
The problem with testing AI
Some of the most important insights about testing AI from the book are:
But we generally don’t evaluate, let alone measure, algorithms holistically, considering all the ways in which they change their users’ decisions-individually or in aggregate.
Most other algorithms continue to be evaluated narrowly using one or two technical criteria and without regard to the kinds of metrics that keep social scientists awake at night-issues such as fairness, accountability, safety, and privacy. Unintended consequences can arise even when every step of an algorithm is carefully designed by a programmer.
These are my insights when I add an understanding of testing:
- The start of a test is the question we ask. A question may come up from exploration, experiments or from a problem faced by the user.
- In general terms, you will need to be creative, as opposed to following a process, to ask the right questions.
- you ask the question is much more important than who asks or when you ask.
- Asking the question is more important than what led you to ask the question — whether it is the use of tools or a thought experiment.
- The examples I have described in this article are definitive, i.e., there is a clear hypothesis followed by an experiment. The actual process of testing (and thinking) is much more fluid and situational. The result of experiments lead to more questions and more experiments. You may also investigate multiple questions. You may use the software, like a user would, to discover more problems. The overarching purpose of testing is to keep learning about the software in order to find unknown risks.
For an article like this, there is no big reveal. I didn’t focus on specific techniques. I also didn’t promise a theory which explains everything. I won’t be smug that testers or developers test AI based software, if we ask questions or are inquisitive. Good testing will require preparation along with aptitude. Finding problems will also require a strong understanding of AI and the related algorithms.
The problem with testing AI is not mistakes or oversights. The challenge is unintended consequences. The problem is with algorithm design choices and their impact on the real world and users. The broader problem with testing AI is not recognizing that we need to ask questions and keep learning. It’s the same problem with testing any other software. Just as with software in general, it may not be apparent that there is a problem to be solved.
1. A Human’s Guide to Machine Intelligence, Kartik Hosanagar, https://www.amazon.com.au/Humans-Guide-Machine-Intelligence-Algorithms/dp/0525560882
2. Kartik Hosanagar is a professor of Marketing at The Wharton School in the University of Pennsylvania. He lists research interests as: internet advertising, e-commerce, digital media. Links: Personal website, Twitter.
3. If you are not a tester, this is a great article to understand testing: https://www.linkedin.com/pulse/how-cem-kaner-testssoftware-nilanjan-bhattacharya/ (All credit goes to the original author — Cem Kaner.)
4. If you want to understand testing, without reading much, you can follow Michael Bolton on Twitter: https://twitter.com/michaelbolton/ He tweets frequently and more importantly, many of his tweets are profound. For more, you can read his blog — www.developsense.com/blog. Another thought leader in testing is James Bach — www.satisfice.com/blog. I recommend the Kindle version of the classic and easy to read book on testing: https://www.amazon.com/Lessons-Learned-Software-Testing-Context-Driven-ebook/dp/B07MGKRDBY
Originally published at https://www.linkedin.com.