../accessibility-ui-automation

From UI automation agents to Screen understanding

Table of contents

再探 Accessibility-UI automation framework.

UINav

Google research publication - UINav: A practical Approach to Train On-Device Automation Agents

A demonstration-based approach to train automation agents that fit mobile devices, yet achieving high success rates with modest number of demonstrations.

What are UI automation agents

Agents that can execute human tasks by interacting directly with the UI of a running application. (Liu et al., 2020; Humphreys et al., 2022)

Existing approaches to UI automation

UI scripts

Enterprises love this to automate business workflows. (UIPath, 2023)

AI-based agents

Imitation learning and reinforcement learning have been tried, but their synthetic test environments are simple and they need large amount of demonstrations.

Transformers and pre-trained large language models (LLMs) have been tried, but they are resource-consuming.

Proposed UINav

UINav finds tradeoff among accuracy, generalizability, and computational costs.

Demonstrations

By ‘demonstration-based’, meaning few shots.

Immediate feedback on failing tasks is provided through a referee model during demonstrations collection, will ask for additional demonstrations. The referee model is trained to predict whether a task is successfully completed.

How to deal with system delays and dynamic UI

  1. Macro actions. This is a bunch of lower-level operations composed as a small program. This brings smaller agents’ state space (about demonstrations) and faster operating speed.
  2. Demonstration augmentation. Human demonstrations are augmented by randomizing non-critical UI elements to increase their diversity.
  3. Utterance masking. Variable sub-strings in utterances are abstracted out.

Untitled

Agent’s neural network architecture

Untitled

The screen representation comes from two source: a)description of attributes [type, text, on-screen position, utterance matching, state], b)representation generated from raw pixels using screen understanding techniques (Chen et al., 2020; Wu et al., 2021; Zhang et al., 2021) [icon detection, text recognition, a tree-structured representation of UI (eg Android accessibility tree)].

Input 1: a set of UI elements. Input 2: an utterance.

Output predicted by decoder: which action to perform [ready-to-act element, action type, action arguments]. Actions has two kinds: a)element actions, b) global actions.

Utterance masking

{Search for tiktok in Google} → replace the specific instance before feeding to agents {Search for placeholder in Google}, while tiktok is saved for later use in performing this task.

Demonstration augmentation

Limit the number of required demonstrations for efficiency. Make non-critical UI elements to be more tolerant to UI changes, by either i)replacing the embedding of their text labels with random vectors, or ii)by adding random offsets to the four scalars of their bounding boxes (ie to randomize both the element’s position and size).

Implementation

Upon Android platform.

Dataset: MoTIF; in-house one.

Agent & referee models are implemented in TensorFlow. Both of them are trained to stable off-device firstly (Python API of TensorFlow), then converted to .tflite (TensorFlowLite) and deploy on device. Both of them use SmallBERT: L-2_H-128_A-2 (the smallest model). No quantization during the conversion of .tflite.

Rely on an in-house built companion Android app to extract screen representations and to perform macro actions.

Off-device mode: Utilize AndroidEnv (Toyama et al., 2021) to communicate between the companion app and the learning environment (models). On-device mode: models interact with the companion app directly. Screen representations and macro actions are designed as model-agnostic.

Macro actions

2021 Learning UI Navigation through demonstrations composed of macro actions.

IconNet

Google search blog: Improving Mobile App Accessibility with Icon Detection

A vision-based object detection model, automatically detecting icons on the screen, agnostic to the underlying structure of the app (ie app-level accessibility trees), launched as part of the Voice Access.

Screen Representation

Chen et al., 2020

Object detection for graphical user interface: old fashioned or deep learning or a combination? Keywords: Android, object detection, user interface, deep learning, computer vision

Object detection for graphical user interface: old fashioned or deep learning or a combination?

Object detection for graphical user interface: old fashioned or deep learning or a combination?

Wu et al., 2021

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots Keywords: user interface modeling, ui semantics, hierarchy prediction

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Model architecture

A standard object detection model, extracting UI elements in a screen with their parameters, specifically a Faster-RCNN with a ResNet-50 backbone. A top-down transition-based parser, constructing UI hierarchy & decoding.

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Zhang et al., 2021

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels Human—centered computing → Accessibility technologies Keywords: mobile accessibility, accessibility enhancement, ui detection

Model architecture

Faster R-CNN, Mask R-CNN. TuriCreate Object Detection toolkits. Single Shot MultiBox Detector (SSD).

Backbone: MobileNetV1. Feature extractor: Feature Pyramid Network (FPN).

Data augmentation on under-represented UI types. Apply class-balanced loss function to give more weight for under-represented UI types.

Training devices: 4 Tesla V100 GPUs for 20 hours (557 k iterations).

Post-processing on filtering out duplicate detections: i)Non-Max Suppression; ii)different confidence thresholds applied on each UI type.

Improvement on raw UI detection results

A) Find missing UI elements and remove extra detections.

B) Recognize rich UI content in Text, Icon, Picture.

C) Determine UI selection state.

D) Determine UI interactivity, eg swipe up or down for a page control, double tap to edit.

E) Group elements for efficient navigation.

F) Infer navigation order.

Future work

Improve the heuristics. Apply on Android. Apply on desktop, Web. Learning structured information in app view hierarchy in addition to pixels information (as well as on-the-fly, out-of-accessibility-mode potential). Incorporate the model into developer tools to help developers. Serve in accessibility APIs. More universal model, resilient to aesthetic qualities and interactions changes.

Datasets

Chen et al., 2018

From UI Design Image to GUI skeleton: A Neural Machine Translator to Bootstrap Mobile GUI Implementation.

Implement an Android UI data controller, Stoat, to automatically collect 185277 pairs of [UI images, GUI skeletons], from 5042 Android apps.

Rico

2017 Rico: A mobile app dataset for building data-driven design applications. Website link.

UI corpus with 72k android UI screens minded from 9.7K android apps. [A screenshot image, a view hierarchy of a collection of UI objects, a set of properties of UI objects (name, type, bounding box position) ].

Training image classification models to recognize icons.

MoTIF

2022 A dataset for interactive vision language navigation with unknown command feasibility. Github link.

A sample includes the natural language command (ie task), app view hierarchy, app screen image, action coordinates for each time step.

Untitled

Seq2act

2020 Mapping Natural Language Instructions to Mobile UI Action Sequences. Github link.

PixelHelp

Pixel Phone Help pages, instructions for performing common tasks on Google Pixel phones.

Untitled

AndroidHowTo

32436 data points from 9893 unique How-To instructions.

RicoSCA

Filter based on Rico. 295476 single-step synthetic commands for operating 177962 different target objects across 25677 Android screens.

Object Detection

CenterNet

2019 CenterNet: Keypoint Triplets for Object Detection

Faster R-CNN

2015 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

YOLOv3

2018 YOLOv3: An Incremental Improvement

Further Readings

Survey on Mobile Accessibility

/ML/ /Accessibility/ /Screen Representation/ /HCI/