• DOI: 10.4018/978-1-4666-6485-2.CH013
  • Corpus ID: 50630972

Usability Evaluation Methods: A Systematic Review

  • A. Martins , A. Queirós , +1 author N. Rocha
  • Published 2015
  • Computer Science

33 Citations

A scoping review of the inquiry instruments being used to evaluate the usability of ambient assisted living solutions.

  • Highly Influenced

Methodological Quality of User-Centered Usability Evaluation of Ambient Assisted Living Solutions: A Systematic Literature Review

Procedures of user-centered usability assessment for digital solutions: scoping review of reviews reporting on digital solutions relevant for older adults, procedures of user-centered usability assessment for digital solutions: scoping review of reviews reporting on digital solutions relevant for older adults (preprint), definition and validation of the icf - usability scale, considerations on the usability of sclínico, guiding usability newcomers to understand the context of use: towards models of collaborative heuristic evaluation, impact of evaluating the usability of assisted technology oriented by protocol, application of the iso 9241-171 standard and usability inspection methods for the evaluation of assistive technologies for individuals with visual impairments, heuristic evaluation for novice evaluators, 76 references, a literature review about usability evaluation methods for e-learning platforms., usability assessment of web interfaces: user testing, the axiomatic usability evaluation method, a study on the time estimation measurement for web usability evaluation, usability testing and expert inspections complemented by educational evaluation: a case study of an e-learning platform, pinpointing usability issues using an effort based framework, the state of the art in automating usability evaluation of user interfaces, usability evaluation methods for a scientific internet information portal, planning and implementing user-centred design, national questionnaire study on clinical ict systems proofs: physicians suffer from poor usability, related papers.

Showing 1 through 3 of 0 Related Papers

  • Open access
  • Published: 03 July 2017

Users’ design feedback in usability evaluation: a literature review

  • Asbjørn Følstad   ORCID: orcid.org/0000-0003-2763-0996 1  

Human-centric Computing and Information Sciences volume  7 , Article number:  19 ( 2017 ) Cite this article

20k Accesses

14 Citations

Metrics details

As part of usability evaluation, users may be invited to offer their reflections on the system being evaluated. Such reflections may concern the system’s suitability for its context of use, usability problem predictions, and design suggestions. We term the data resulting from such reflections users’ design feedback . Gathering users’ design feedback as part of usability evaluation may be seen as controversial, and the current knowledge on users’ design feedback is fragmented. To mitigate this, we have conducted a literature review. The review provides an overview of the benefits and limitations of users’ design feedback in usability evaluations. Following an extensive search process, 31 research papers were identified as relevant and analysed. Users’ design feedback is gathered for a number of distinct purposes: to support budget approaches to usability testing, to expand on interaction data from usability testing, to provide insight into usability problems in users’ everyday context, and to benefit from users’ knowledge and creativity. Evaluation findings based on users’ design feedback can be qualitatively different from, and hence complement, findings based on other types of evaluation data. Furthermore, findings based on users’ design feedback can hold acceptable validity, though the thoroughness of such findings may be questioned. Finally, findings from users’ design feedback may have substantial impact in the downstream development process. Four practical implications are highlighted, and three directions for future research are suggested.

Introduction

Involving users in usability evaluation is valuable when designing information and communication technology (ICT), and a range of usability evaluation methods (UEM) support user involvement. Relevant methods include adaptations of usability testing [ 1 ], usability inspection methods such as pluralistic walkthrough [ 2 ], and inquiry methods such as interviews [ 3 ], and focus groups [ 4 ].

Users involved in usability evaluation may generate two types of data. We term these interaction data and design feedback . Interaction data are recordings of the actual use of an interactive system, such as observational data, system logs, and data from think-aloud protocols. Design feedback are data on users’ reflections concerning an interactive system, such as comments on experiential issues, considerations of the system’s suitability for its context of use, usability problem predictions, and design suggestions.

The value of interaction data in evaluation is unchallenged. Interaction data is held to be a key source of insight in the usability of interactive systems and has been the object of thorough scientific research. Numerous empirical studies concern the identification of usability problems on the basis of observable user behaviour [ 5 ]. Indeed, empirical UEM assessments are typically done by comparing the set of usability problems identified through the assessed UEM with a set of usability problems identified during usability testing (e.g. [ 6 , 7 ]).

The value of users’ design feedback is, however, disputed. Nielsen [ 8 ] stated, as a first rule of usability, “don’t listen to users” and argued that users’ design feedback should be limited to preference data after having used the interactive system in question. Users’ design feedback may be biased due to a desire to report what the evaluator wants to hear, imperfect memory, and rationalization of own behaviour [ 8 , 9 ]. As discussed by Gould and Lewis [ 10 ], it can be challenging to elicit useful design information from users as they may not have considered alternative approaches or may be ignorant of relevant alternatives; users may simply be unaware of what they need. Furthermore, as discussed by Wilson and Sasse [ 11 ], users do not always know what is good for them and may easily be swayed by contextual factors when making assessments.

Nevertheless, numerous UEMs that involve the gathering and analysis of users’ design feedback have been suggested (e.g. [ 12 – 14 ]), and textbooks on usability evaluations typically recommend gathering data on users’ experiences or considerations in qualitative post-task or post-test interviews [ 1 , 15 ]. It is also common among usability practitioners to ask for the opinion of the participants in usability testing pertaining to usability problems or design suggestions [ 16 ].

Our current knowledge of users’ design feedback is fragmented. Despite the number of UEMs suggested to support the gathering of users’ design feedback, no coherent body of knowledge on users’ design feedback as a distinct data source has been established. Existing empirical studies of users’ design feedback typically involve the assessment of one or a small number of UEMs, and only to a limited degree build on each other. Consequently, a comprehensive overview of existing studies on users’ design feedback is needed to better understand the benefits and limitation of this data source in usability evaluation.

To strengthen our understanding of users’ design feedback in usability evaluation we present a review of the research literature on such design feedback. Footnote 1 Through the review, we have sought to provide an overview the benefits and limitations of users’ design feedback. In particular, we have investigated users’ design feedback in terms of the purposes for which it is gathered, its qualitative characteristics, its validity and thoroughness, as well as its downstream utility.

Our study is not an attempt to challenge the benefit of interaction data in usability evaluation. Rather, we assume that users’ design feedback may complement other types of evaluation data, such as interaction data or data from inspections with usability experts, thereby strengthening the value of involving users in usability evaluation.

The scope of the study is delimited to qualitative or open-ended design feedback; such data may provide richer insight into the potential benefits and limitations of users’ design feedback than do quantitative or set-response design feedback. Hence, design feedback in the form of data from set-response data gathering methods, such as standard usability questionnaires, are not considered in this review.

  • Users’ design feedback

In usability evaluation, users may engage in interaction and reflection. During interaction the user engages in behaviour that involves the user interface of an interactive system or its abstraction, such as a mock-up or prototype. The behaviour may include think-aloud verbalization of the immediate perceptions and thoughts that accompany the user’s interaction. The interaction may be recorded through video, system log data, and observation forms or notes. We term such records interaction data. Interaction data is a key data source in usability testing and typically leads to findings formulated as usability problems, or to quantitative summaries such as success rate, time on task, and number of errors [ 1 ].

During reflection, the user engages in analysis and interpretation of the interactive system or the experiences made during system interaction. Unlike the free-flowing thought processes represented in think-aloud data, user reflection typically is conducted after having used the interactive system or in response to a demonstration or presentation of the interactive system. User reflection can be made on the basis of system representations such as prototypes or mock-ups, but also on the basis of pre-prototype documentation such as concept descriptions, and may be recorded as verbal or written reports. We refer to records of user reflection as design feedback, as their purpose in usability evaluation typically is to support the understanding or improvement of the evaluated design. Users’ design feedback often lead to findings formulated as usability problems, (e.g. [ 3 , 17 ]), but also to other types of findings such as insight into users’ experiences of a particular design [ 18 ], input to user requirements [ 19 ], and suggestions for changes to the design [ 20 ].

What we refer to as users ’ design feedback eclipses what has been termed user reports [ 9 ], as its scope includes data on user’ reflections not only from inquiry methods but also from usability inspection and usability testing.

UEMs for users’ design feedback

The gathering and analysis of users’ design feedback is found in all the main UEM groups, that is, usability inspection methods, usability testing methods, and inquiry methods [ 21 ].

Usability inspection, though typically conducted by trained usability experts [ 22 ], is acknowledged to be useful also with other inspector types such as “end users with content or task knowledge” [ 23 ]. Specific inspection methods have been developed to involve users as inspectors. In the pluralistic walkthrough [ 2 ] and the participatory heuristic evaluation [ 13 ] users are involved in inspection groups together with usability experts and developers. In the structured expert evaluation method [ 24 ] and the group-based expert walkthrough [ 25 ] users can be involved as the only inspector type.

Several usability testing methods have been developed where interaction data is complemented with users’ design feedback, such as cooperative evaluation, cooperative usability testing, and asynchronous remote usability testing. In the cooperative evaluation [ 14 ] the user is told to think of himself as a co-evaluator and encouraged to ask questions and to be critical. In the cooperative usability testing [ 26 ] the user is invited to review the task solving process upon its completion and to reflect on incidents and potential usability problems. In asynchronous remote usability testing the user may be required to self-report incidents or problems, as a substitute of having these identified on the basis of interaction data [ 27 ].

Inquiry methods typically are general purpose data collection methods that have been adapted to the purpose of usability evaluation. Prominent inquiry methods in usability evaluation are interviews [ 3 ], workshops [ 28 ], contextual inquiries [ 29 ], and focus groups [ 30 ]. Also, online discussion forums have been applied for evaluation purposes [ 17 ]. Inquiry methods used for usability evaluation are generally less researched than methods for usability inspection methods and usability testing [ 21 ].

Motivations for gathering users’ design feedback

There are two key motivations for gathering design feedback from users: users as a source of knowledge and users as a source of creativity.

Knowledge of a system’ context of use is critical in design and evaluation. Such knowledge, which we in the following call domain knowledge, can be a missing evaluation resource [ 22 ]. Users have often been pointed out as a possible source of domain knowledge during evaluation [ 12 , 13 ]. Users’ domain knowledge may be most relevant for usability evaluations in domains requiring high levels of specialization or training, such as health care or gaming. In particular, users’ domain knowledge may be critical in domains where the usability expert cannot be expected to have overlapping knowledge [ 25 ]. Hence, it may be expected that the user reflections that are captured in users’ design feedback are more beneficial for applications specialized to a particular context of use than for applications with a broader target user group.

A second motivation to gather design feedback from users is to tap into their creative potential. This perspective has, in particular, been argued within participatory design. Here, users, developers, and designers are encouraged to exchange knowledge, ideas, and design suggestions in cooperative design and evaluation activities [ 31 ]. In a survey of usability evaluation state-of-the-practice, Følstad, Law, and Hornbæk [ 16 ] found that it is common among usability practitioners to ask participants in usability testing questions concerning redesign suggestions.

How to review studies of users’ design feedback?

Through a wide range of UEMs that involve users’ design feedback have been suggested, current knowledge on users’ design feedback is fragmented; in part, because the literature on relevant UEMs often do not present detailed empirical data on the quality of users’ design feedback (e.g. [ 2 , 13 , 31 ]).

We do not have a sufficient overview of the purposes for which users’ design feedback is gathered. Furthermore, we do not know the degree to which users’ design feedback serves its purpose as usability evaluation data. Does users’ design feedback really complement other evaluation data sources, such as interaction data and usability experts’ findings? To what degree can users’ design feedback be seen as a credible source of usability evaluation findings; that is, what levels of validity and thoroughness can be expected? And to what degree does users’ design feedback have an impact in the downstream development process?

To get an answer to these questions concerning users’ design feedback, we needed to single out that part of the literature which presents empirical data this topic. We assumed that this literature typically would have the form of UEM assessments, where data on users’ design feedback is compared to some external criterion to investigate its qualitative characteristics, validity and thoroughness, or downstream impact. UEM assessment as form of scientific enquiry has deep roots in the field of human–computer interaction (HCI); flourishing since the early nineties, typically pitting UEMs against each other to investigate their relative strengths and limitations (e.g. [ 32 , 33 ]). Following Gray and Salzman’s [ 34 ] criticism of early UEM assessments, studies have mainly targeted validity and thoroughness [ 35 ]. However, also aspects such as downstream utility [ 36 , 37 ] and the qualitative characteristics of the output of different UEMs (e.g. [ 38 , 39 ]) have been investigated in UEM assessments.

In our literature review, we have identified and analysed UEM assessments where the evaluation data included in the assessment at least in part are users’ design feedback.

Research question

Due to the exploratory character of the study, the following main research question was defined:

Which are the potential benefits and limitations of users’ design feedback in usability evaluations?

The main research question was then broken down into four sub-questions, following from the questions raised in the section “ How to review studies of users' design feedback? ”.

RQ1: For which purposes are users’ design feedback gathered in usability evaluation?

RQ2: How do the qualitative characteristics of users’ design feedback compare to that of other evaluation data (that is, interaction data and design feedback from usability experts)?

RQ3: Which levels of validity and thoroughness are to be expected for users’ design feedback?

RQ4: Which levels of downstream impact are to be expected for users’ design feedback?

The literature review was set up following the guidelines of Kitchenham [ 40 ], with some adaptations to fit the nature of the problem area. In this " Methods " section we describe the search, selection, and analysis process.

Search tool and search terms

Before conducting the review, we were aware of only a small number of studies concerning users’ design feedback in usability evaluation; this in spite of our familiarity with the literature on UEMs. Hence, we decided to conduct the literature search through the Google Scholar search engine to allow for a broader scoping of publication channels than what is supported in other broad academic search engines such as Scopus or Web of Knowledge [ 41 ]. Google Scholar has been criticized for including a too broad range of content in its search results [ 42 ]. However, for the purpose of this review, where we aimed to conduct a broad search across multiple scientific communities, a Google Scholar search was judged to be an adequate approach.

To establish good search terms we went through a phase of trial and error. The key terms of the research question, user and “ design feedback ”, were not useful even if combined with “ usability evaluation ”; the former due to its lack of discriminatory ability within the HCI literature, the latter because it is not an established term within the HCI field. Our solution to the challenge of establishing good search terms was to use the names of UEMs that involve users’ design feedback. An initial list of relevant UEMs was established on the basis of our knowledge of the HCI field. Then, whenever we were made aware of other relevant UEMs throughout the review process, these were included as search terms along with the other UEMs. We also included the search term “ user reports ” (combined with “ usability evaluation ”) as this term partly overlaps the term design feedback. The search was conducted in December 2012 and January 2013.

Table  1 lists the UEM names forming the basis of the search. For methods or approaches that are also used outside the field of HCI (cooperative evaluation, focus group, interview, contextual inquiry, the ADA approach, and online forums for evaluation) the UEM name was combined with the term usability or “ usability evaluation ”.

To balance the aim for a broad search with the resources available, we set a cut-off at the 100 first hits for each search. For searches that returned fewer hits, we included all. The first 100 hits is, of course, an arbitrary cut-off and it is possible that more relevant papers had been found if this limit was extended. Hence, while the search indeed is broad it cannot claim complete coverage. We do not, however, see this as a problematic limitation. In practice, the cut-off was found to work satisfactorily as the last part of the included hits for a given search term combination typically returned little of interest for the purposes of the review. Increasing the number of included hits for each search combination would arguably have given diminishing returns.

Selection and analysis

Each of the search result hits was examined according to publication channel and language. Only scientific journal and conference papers were included, as the quality of these is verified through peer review. Also, for practical reasons, only English language publications were included.

All papers were scrutinized with regard to the following inclusion criterion: Include papers with conclusions on the potential benefits and limitations of users ’ design feedback. Papers excluded were typically conceptual papers presenting evaluation methods without presenting conclusions, studies on design feedback from participants (often students) that were not also within the target user group of the system, and studies that did not include qualitative design feedback but only quantitative data collection (e.g. set-response questionnaires). In total 41 papers were retained following this filtering. Included in this set were three papers co-authored by the author of this review [ 19 , 25 , 43 ].

The retained papers were then scrutinized according to possible overlapping studies and errors in classification. Nine papers were excluded as these presented the same data on users’ design feedback as had already been presented in other of the identified papers, but in less detail. One paper was excluded as it had been erroneously classified as a study of evaluation methods.

In the analysis process, all papers were coded on four aspects directly reflecting the research question: the purpose of the gathered users’ design feedback (RQ1), the qualitative characteristics of the evaluation output (RQ2), assessments of validity and thoroughness (RQ3), and assessments of downstream impact (RQ4). Furthermore, all papers were coded according to UEM type, evaluation output types, comparison criterion (the criteria used, if any, to assess the design feedback), the involved users or participants, and research design.

The papers included for analysis concerned users’ design feedback gathered through a wide range of methods from all the main UEM groups. The papers presented studies where users’ design feedback was gathered through usability inspections, usability testing, and inquiry methods. Among the usability testing studies, users’ design feedback was gathered both as extended debriefs and for users’ self-reporting of problems or incidents. The inquiry methods were used both for stand-alone usability evaluations and as part of field tests (see Table  2 ). This width in studies should provide a good basis for making general claims on the benefits and limitations of users’ design feedback.

Of the analysed studies, 19 provided detailed empirical data supporting their conclusions. The remaining studies presented the findings only summarily. The studies which provided detailed empirical data ranged from problem-counting head-to-head UEM comparisons, (e.g. [ 3 , 17 , 27 , 44 ]) to in-depth reports on lessons learnt concerning a particular UEM (e.g. [ 30 , 45 ]). All but two of the studies with detailed presentations of empirical data [ 20 , 30 ] compared evaluation output from users’ design feedback to output from interaction data and/or data from inspections with usability experts.

In the presented studies, users’ design feedback was typically treated as a source to usability problems or incidents; this in spite that users’ design feedback may serve as a gateway also to other types of evaluation output such as experiential issues, reflections on the system’s context of use, and design suggestions. The findings from this review therefore mainly concern usability problems or incidents.

The purpose of gathering users’ design feedback (RQ1)

In the reviewed studies, different data collection methods for users’ design feedback were often pitted against each other. For example, Bruun et al. [ 44 ] compared online report forms, online discussion forum, and diary as methods to gather users’ self-reports of problems or incidents. Henderson et al. [ 3 ] compared interviews and questionnaires as means of gathering details on usability problems as part of usability testing debriefs. Cowley and Radford-Davenport [ 20 ] compared online discussion forum and focus groups for purposes of stand-alone usability evaluations.

These comparative studies surely provide relevant insight into the differences between specific data collection methods for users’ design feedback. However, though comparative, most of these studies mainly addressed one specific purpose for gathering users’ design feedback. Bruun et al. only considered users’ design feedback in the context of users’ self-reporting of problems in usability tests. Henderson et al. [ 3 ] only considered users’ self-reporting during usability testing debriefs. Cowley and Radford-Davenport [ 20 ] only considered methods for users’ design feedback as stand-alone evaluation methods. We therefore see it as beneficial to contrast the different purposes for gathering users’ design feedback in the context of usability evaluations.

Four specific purposes for gathering users’ design feedback were identified: (a) a budget approach to problem identification in usability testing, (b) to expand on interaction data from usability testing, (c) to identify problems in the users’ everyday context, and (d) to benefit from users’ knowledge or creativity.

The budget approach

In some of the studies, users’ design feedback was used as a budget approach to reach findings that one could also have reached through classical usability testing. This is, in particular, seen in the five studies of usability testing with self-reports where the users’ design feedback consisted mainly of reports of problems or incidents [ 27 , 44 , 46 – 48 ]. Here, the users were to run the usability test and report on the usability problems independently of the test administrator, potentially saving evaluation costs. For example, in their study of usability testing with disabled users, Petrie et al. [ 48 ] compared the self-reported usability problems from users that self-administer the usability test at home to those that participate in a similar usability test in the usability laboratory. Likewise, Andreasen et al. [ 27 ], Bruun et al. [ 44 ] compared different approaches to remote asynchronous usability testing. In these studies of self-reported usability problems, users’ design feedback hardly generated findings that complemented other data sources. Rather, the users’ design feedback mainly generated a subset of the usability problems already identified through interaction data.

Expanding on interaction data

Other reviewed studies concerned how users’ design feedback may expand on usability test interaction data. This was seen in some of the studies where users’ design feedback is gathered as part of the usability testing procedure or debrief session [ 4 , 14 , 19 , 49 , 59 ]. Here, users’ design feedback generated additional findings rather than merely reproducing the findings of the usability test interaction data. For example, O’Donnel et al. [ 4 ] showed how the participants of a usability test converged on new suggestions for redesign in focus group sessions following the usability test. Similarly, Følstad and Hornbæk [ 19 ] found the participants of a cooperative usability test to identify other types of usability issues when walking through completed tasks of a usability test than the issues already evident through the interaction data. In both these studies, the debrief was set up so as to aid the memory of the users by the use of video recordings from the test session [ 4 ] or by walkthroughs of the test tasks [ 19 ]. Other studies were less successful in generating additional findings through such debrief sessions. For example, Henderson et al. [ 3 ] found that users during debrief interviews, though readily reporting problems, were prone to issues concerning recall, recognition, overload, and prominence. Likewise, Donker and Markopoulos [ 51 ], in their debrief interviews with children, found them susceptible of forgetfulness. Neither of these studies included specific memory aids during the debrief session.

Problem reports from the everyday context

Users’ design feedback may also serve to provide insight that is impractical to gather by other data sources. This is exemplified in the four studies concerning users’ design feedback gathered through inquiry methods as part of field tests [ 17 , 28 , 45 , 52 ]. Here, users reported on usability problems as they appear in everyday use of the interactive system, rather than usability problems encountered during the limited tasks of a usability test. As such, this form of users’ design feedback provides insight into usability problems presumably holding high face validity, and that may be difficult to identify during usability testing. For example, Christensen and Frøkjær [ 45 ], gathered user reports on problems with a fleet management systems through an integrated reporting software. Likewise, Horsky et al. gathered user reports on problems with a medial application through emails from medical personnel. The user reports in these studies, hence, provided insight into problems as they appeared in the work-day of the fleet managers and medical personnel respectively.

Benefitting from users’ knowledge and creativity

Finally, in some of the studies, users’ design feedback was gathered with the aim of benefiting from the particular knowledge or creativity of users. This is, in particular, seen in studies where users were involved as usability inspectors [ 25 , 43 , 53 , 54 ] and in studies where inquiry methods were applied for stand-alone usability evaluations [ 20 , 28 , 30 , 55 , 56 ]. Also, some of the studies where users’ design feedback was gathered through extended debriefing sections had such a purpose [ 3 , 4 , 19 , 57 ]. For example, in their studies of users as usability inspectors, Barcelos et al. [ 53 ], Edwards et al. [ 54 ], and Følstad [ 25 ] found the user inspectors to be particularly attentive to other aspects of the interactive systems than did the usability expert inspectors. Cowley and Radford-Davenport [ 20 ], as well as Ebenezer [ 58 ], in their studies of focus groups and discussion forums for usability evaluation, found participants to eagerly provide design suggestions, as did Sylaiou et al. [ 64 ] in their study of evaluations based on interviews and questionnaires with open-ended questions. Similarly, O’Donnel et al. [ 4 ] found users in focus groups arranged as follow-ups to classical usability testing sessions to identify and develop design suggestions; in particular in response to tasks that were perceived by the users as difficult.

How do the qualitative characteristics of users’ design feedback compare to that of other evaluation data? (RQ2)

Given that users design feedback is gathered with the purpose of expanding on the interaction data from usability testing, or with the aim of benefitting from users knowledge and creativity, it is relevant to know whether users’ design feedback actually generate findings that are different to what one could have reached through other data sources. Such knowledge may be found in the studies that addressed the qualitative characteristics of the usability issues identified on the basis on users’ design feedback.

The qualitative characteristics of the identified usability issues were detailed in nine of the reviewed papers [ 17 , 19 , 20 , 25 , 28 , 52 – 54 , 59 ]. These studies indeed suggest that evaluations based on users’ design feedback may generate output that is qualitatively different from that of evaluations based on other types of data. A striking finding across these papers is the degree to which users’ design feedback may facilitate the identification of usability issues specific to the particular domain of the interactive system. In six of the papers addressing the qualitative characteristics of the evaluation output [ 19 , 25 , 28 , 52 – 54 ], the findings based on users’ design feedback concerned domain-specific issues not captured by the alternative UEMs. For example, in a heuristic evaluation of virtual world applications, studied by Barcelos et al. [ 53 ], online gamers that were representative of the typical users of the applications identified relatively more issues related to the concept of playability than did usability experts. Emergency response personnel and mobile salesforce representatives involved in cooperative usability testing, studied by Følstad and Hornbæk [ 19 ], identified more issues concerning needed functionality and organisational requirements when providing design feedback in the interpretation phases of the testing procedure than when providing interaction data in the interaction phases. The users of a public sector work support system, studied by Hertzum [ 28 ], identified more utility-problems when in a workshop test, where the users were free to provide design feedback, than they did in a classical usability test. Hertzum suggested that the rigidly set tasks, observational setup, and formal setting of the usability test made this evaluation “biased toward usability at the expense of utility”, whereas the workshop allowed more free exploration on the basis of the participants’ work knowledge which was beneficial for the identification of utility problems and bugs.

In two of the studies, however, the UEMs involving users’ design feedback were not reported to generate more domain-specific issues than did the other UEMs [ 17 , 59 ]. These two studies differed from the others on one important point: the evaluated systems were general purpose work support systems (one spreadsheet system and one system for electronic Post-It notes), not systems for specialized work support. A key motivation for gathering users’ design feedback is that users possess knowledge not held by other parties of the development process. Consequently, as the contexts of use for these two systems most likely were well known to the involved development teams, the value of tapping into user’s domain knowledge may have been lower than for the evaluations of more specialized work support systems.

The studies concerning the qualitative characteristics of users’ design feedback also suggested the importance of not relying solely on such feedback. In all the seven studies, findings from UEMs based on users’ design feedback were compared with findings from UEMs based on other data sources (interaction data or usability experts’ findings). In all of these, the other data sources generated usability issues that were not identified from the users’ design feedback. For example, the usability experts in usability inspections studied by Barcelos et al. [ 53 ] and Følstad [ 25 ] identified a number of usability issues not identified by the users; issues that also had different qualitative characteristics. In the study by Barcelos et al. [ 53 ], the usability expert inspectors identified more issues pertaining to system configuration than did the user inspectors. In the study by Følstad [ 25 ], the usability expert inspectors identified more domain-independent issues. Hence, depending only on users’ design feedback would have limited the findings with respect to issues related to what Barcelos et al. [ 53 ] referred to as “the classical usability concept” (p. 303).

These findings are in line with our assumption that users’ design feedback may complement other types of evaluation data by supporting qualitatively different evaluation output, but not replace other evaluation data. Users’ design feedback may constitute an important addition to other evaluation data sources, by supporting the identification of domain specific usability issues and, also, user-based suggestions for redesign.

Which levels of validity and thoroughness are to be expected for users’ design feedback? (RQ3)

To rely on users’ design feedback as data in usability evaluations, we need to trust the data. To be used for any evaluation purpose, the findings based on users’ design feedback need to hold adequate levels of validity ; that is, the usability problems identified during the evaluation should reflect problems that the user can be expected to encounter when using the interactive system outside the evaluation context. Furthermore, if users’ design feedback is to be used as the only data in usability evaluations, it is necessary to know the levels of thoroughness that can be expected; that is, the degree to which the evaluation serves to identify all relevant usability problems that the user can be expected to encounter.

Following Hartson et al. [ 35 ], validity and thoroughness scores can be calculated on the basis of (a) the set of usability problems predicted with a particular UEM and (b) the set of real usability problems, that is, usability problems actually encountered by users outside the evaluation context. The challenge of such calculations, however, is that we need to establish a reasonably complete set of real usability problems. This challenge has typically been resolved by using the findings from classical usability testing as an approximation to such a set [ 65 ], though this approach introduces the risk of erroneously classifying usability problems as false alarms [ 6 ].

A substantial proportion of the reviewed papers present general views on the validity of the users’ design feedback. However, only five of the papers included in the review provide sufficient detail to calculate validity scores. This, provided that we assume that classical laboratory testing can serve as an approximation to the complete set of real usability problems. In three of these [ 44 , 46 , 47 ], the users’ design feedback was gathered as self-reports during remote usability testing, in one [ 3 ] users’ design feedback was gathered during usability testing debrief, and in one [ 43 ] users’ design feedback was gathered through usability inspection. The validity scores ranged between 60% [ 43 ] and 89% [ 47 ], meaning that in all of the studies 60% or more of the usability problems or incidents predicted by the users were also confirmed by classical usability testing.

The reported validity values for users’ design feedback were arguably acceptable. For comparison, in newer empirical studies of heuristic evaluation with usability experts the validity of the evaluation output has typically been found to be well below 50% (e.g. [ 6 , 7 ]). Furthermore, following from the challenge of establishing a complete set of real usability problems, it may be assumed that several of the usability problems not identified in classical usability testing may nevertheless represent real usability problems [ 43 , 47 ].

Thoroughness concerns the proportion of predicted real problems relative to the full set of real problems [ 35 ]. Some of the above studies also provided empirical data that can be used to assess the thoroughness of users’ design feedback. In the Hartson and Castillo [ 47 ] study, 68% of the critical incidents observed during video analysis were also self-reported by the users. The similar proportion for the study by Henderson et al. [ 3 ] on problem identification from interviews was 53%. For the study on users as usability inspectors by Følstad et al. [ 43 ] the median thoroughness score for individual inspectors was 25%; however, for inspectors in nominal groups of seven thoroughness scores were raised to 70%. Larger numbers of evaluators or users is beneficial to thoroughness [ 35 ]. This is, in particular, seen in the study of Bruun et al. [ 44 ] where 43 users self-reporting usability problems in remote usability evaluations were able to identify 78% of the problems identified in classical usability testing. For comparison, in newer empirical studies of heuristic evaluation with usability experts thoroughness is typically well above 50% (e.g. [ 6 , 7 ]).

The empirical data on thoroughness seem to support the conclusion that users typically underreport problems in their design feedback, though the extent of such underreporting varies widely between evaluations. In particular, involving larger numbers of users may mitigate this deficit in users’ design feedback as an evaluation data source.

Which levels of downstream impact are to be expected for users’ design feedback? (RQ4)

Seven of the papers presented conclusions concerning the impact of users’ design feedback on the subsequent design process; that is, whether the issues identified during evaluations lead to change in later versions of the system. Rector et al. [ 60 ], Obrist et al. [ 56 ], and Wright and Monk [ 14 ] concluded that the direct access to users’ reports served to strengthen the understanding in the design team of the users’ needs. The remaining four studies concerning downstream impact, provided more detailed evidence on this.

In a study by Hertzum [ 28 ], the impact ratio for a workshop test was found to be more than 70%, which was similar to that of a preceding usability test in the same development process. Hertzum argued that a key factor determining the impact of an evaluation is its location in time: evaluations early in the development process are argued to have more impact than late evaluations. Følstad and Hornbæk [ 19 ], in their study of cooperative usability testing, found the usability issues identified on the basis of users’ design feedback during interpretation phases to have equal impact to those identified on the basis of interaction data. Følstad [ 25 ] in his study of users and usability experts as inspectors for applications for three specialized domains, found usability issues identified users on average to have higher impact than those of usability experts. Horsky et al. [ 52 ] studied usability evaluations of a medical work support system by way of users’ design feedback through email and free-text questionnaires during field trial, and compared the findings from these methods to findings from classical usability testing and inspections conducted by usability experts. Here, 64% of the subsequent changes to the system were motivated from issues reported in users’ self-reports by email. E-mail reports were also the most prominent source of users’ design feedback; 85 of a total of 155 user comments were gathered through such reports. Horsky et al. suggested the problem types identified from the e-mail reports to be an important reason for the high impact of the findings from this method.

Discussion and conclusion

The benefits and limitations of users’ design feedback.

The literature review has provided an overview concerning the potential benefits and limitations of users’ design feedback. We found that users’ design feedback can be gathered for four purposes. When users’ design feedback is gathered to expand on interaction data from usability testing, as in usability testing debriefs (e.g. [ 4 ]), or benefitting from the users’ knowledge or creativity, as in usability inspections with user inspectors (e.g. [ 53 ]), it is critical that the evaluation output include findings that complement what could be achieved through other evaluation data sources; if not, the rationale for gathering users’ design feedback in such studies is severely weakened. When users’ design feedback is gathered as a budget approach to classical usability testing, as in asynchronous remote usability testing (e.g. [ 44 ]), or a way to identify problems in the users’ everyday context, as in inquiry methods as part of field tests (e.g. [ 45 ]), it is critical that the evaluation output holds adequate validity and thoroughness.

The studies included in the review indicate that users’ design feedback may indeed complement other types of evaluation data. This is seen in the different qualitative characteristics for findings made on the basis of users’ design feedback compared to those made from other evaluation data types. This finding is important, as it may motivate usability professionals to make better use of UEMs particularly designed to gather of users’ design feedback to complement other evaluation data. Such UEMs may include the pluralistic walkthrough, where users participate as inspectors in groups with usability experts and development team representatives, and the cooperative usability testing, where users’ design feedback is gathered through dedicated interpretation phases added to the classical usability testing procedure. Using UEMs that support users’ design feedback seems to be particularly important when evaluating systems for specialized domains, such as that of medical personnel or public sector employees. Possibly, the added value of users’ design feedback as a complementary data source may be reduced in evaluations of interactive systems for the general public; here, the users’ design feedback may not add much to what is already identified through interaction data or usability experts’ findings.

Furthermore, the reviewed studies indicated that users’ can self-report incidents or problems validly. For usability testing with self-reporting of problems, validity values for self-reports were consistently 60% or above; most identified incidents or problems made during self-report were also observed during interaction. In the studies providing validity findings, the objects of evaluation were general purpose work support systems or general public websites, potentially explaining why the users did not make findings more complementary to that of the classical usability test.

Users were, however, found to be less able with regard to thoroughness. In the reviewed studies, thoroughness scores varied from 25 to 78%. A relatively larger number of users’ seems to be required to reach adequate thoroughness through users’ design feedback than through interaction data. Evaluation depending solely on users’ design feedback may need to increase the number of users relative to what would be done e.g. for classical usability testing.

Finally, issues identified from users’ design feedback may have substantial impact in the subsequent development process. The relative impact of users’ design feedback compared to that of other data sources may of course differ between studies and development process, e.g. due to contextual variation. Nevertheless, the reviewed studies indicate users’ design feedback to be at least as impactful as evaluation output from other data sources. This finding is highly relevant for usability professionals, whom typically aim to get the highest possible impact on development. One reason why findings from users’ design feedback were found to have relatively high levels of impact may be that such findings, as opposed to, for example, the findings of usability experts in usability inspections, allow the development team to access the scarce resource of users’ domain knowledge. Hence, the persuasive character of users’ design feedback may be understood as a consequence of it being qualitatively distinct from evaluation output from other data sources, rather than merely being a consequence of this feedback coming straight from the users.

Implications for usability evaluation practice

The findings from the review may be used to advice usability evaluation practice. In the following, we summarize what we find to be the most important take-away for practitioners:

Users’ design feedback may be particularly beneficial when conducting evaluation of interactive systems for specialized contexts of use. Here, users’ design feedback may generate findings that complement those based on other types of evaluation data. However, for this benefit to be realized, the users’ design feedback should be gathered with a clear purpose of benefitting from the knowledge and creativity of users.

When users’ design feedback is gathered through extended debriefs, users are prone to forgetting encountered issues or incidents. Consider supporting the users recall by the use of, for example, video recordings from system interaction or by walking through the task.

Users’ design feedback may support problem identification, in evaluations where the purpose is a budget approach to usability testing or problem reporting from the field. However, due to challenges in thoroughness, it may be necessary to scale up such evaluations to involve more users than would be needed e.g. for classical usability testing.

Evaluation output based on users’ design feedback seems to be impactful in the downstream development process. Hence, gathering users’ design feedback may be an effective way to boost the impact of usability evaluation.

Limitations and future work

Being a literature review, this study is limited by the research papers available. Though evaluation findings from interaction data and inspections with usability experts have been thoroughly studied in the research literature, the literature on users’ design feedback is limited. Furthermore, as users’ design feedback is not used as a term in the current literature, the identification of relevant studies was challenging to the point that we cannot be certain that not some relevant study has passed unnoticed.

Nonetheless, the identified papers, though concerning a wide variety of UEMs, were found to provide reasonably consistent findings. Furthermore, the findings suggest that users’ design feedback is a promising area for further research on usability evaluation.

The review also serves to highlight possible future research directions, to optimize UEMs for users’ design feedback and to further investigate which types of development processes that in particular benefit from users’ design feedback. In particular, the following topics may be highly relevant for future work:

More systematic studies of the qualitative characteristics of UEM output in general, and users’ design feedback in particular. In the review, a number of studies addressing various qualitative characteristics were identified. However, to optimize UEMs for users’ design feedback it may be beneficial to study the qualitative characteristics of evaluation output according to more comprehensive frameworks where feedback is characterized e.g. in terms of being general or domain-specific as well as being problem oriented, providing suggestions, or concerning the broader context of use.

Investigating users’ design feedback across types of application areas. The review findings suggest that the usefulness of users’ design feedback in part may be decided by application area. In particular, application domains characterized by high levels of specialization may benefit more from evaluations including users’ design feedback, as the knowledge represented by the users are not as easily available through other means as for more general domains. Future research is needed for more in-depth study of this implication of the findings.

Systematic studies of users’ design feedback across the development process. It is likely, as seen from the review, that the usefulness of users’ design feedback may be dependent on which stage of the development process in which the evaluation is conducted. Furthermore, different stages of the development process may require different UEMs for gathering users’ design feedback. In the review, we identified four typical motivations for gathering users’ design feedback. These may serve as a starting point for further studies of users’ design feedback across the development process.

While the review provides an overview of our current and fragmented knowledge of users’ design feedback, important areas of research still remain. We conclude that users’ design feedback is a worthy topic of future UEM research, and hope that this review can serve as a starting point for this endeavour.

The review is based on the author’s Ph.D. thesis on users’ design feedback, where it served to position three studies conducted by the authors relative to other work done within this field. The review presented in this paper includes these three studies as they satisfy the inclusion criteria for the review. It may also be noted that, to include a broader set of perspectives on the benefits and limitations of users’ design feedback, the inclusion criteria applied in the review presented here is more relaxed compared to that of the Ph.D. thesis. The thesis was accepted at the University of Oslo in 2014.

Rubin J, Chisnell D (2008) Handbook of usability testing: how to plan, design, and conduct effective tests, 2nd edn. Wiley, Indianapolis

Google Scholar  

Bias RG (1994) The pluralistic usability walkthrough: coordinated empathies. In: Nielsen J, Mack RL (eds) Usability inspection methods. Wiley, New York, pp 63–76

Henderson R, Podd J, Smith MC, Varela-Alvarez H (1995) An examination of four user-based software evaluation methods. Interact Comput 7(4):412–432

Article   Google Scholar  

O’Donnel PJ, Scobie G, Baxter I (1991) The use of focus groups as an evaluation technique in HCI. In: Diaper D, Hammond H (eds) People and computers VI, proceedings of HCI 1991. Cambridge University Press, Cambridge, pp 212–224

Lewis JR (2006) Sample sizes for usability tests: mostly math, not magic. Interactions 13(6):29–33

Chattratichart J, Brodie J (2004) Applying user testing data to UEM performance metrics. In: Dykstra-Erickson E, Tscheligi M (eds) CHI’04 extended abstracts on human factors in computing systems. ACM, New York, pp 1119–1122

Hvannberg ET, Law EL-C, Lárusdóttir MK (2007) Heuristic evaluation: comparing ways of finding and reporting usability problems. Interact Comput 19(2):225–240

Nielsen J (2001) First rule of usability? don’t listen to users. Jakob Nielsen’s Alertbox: August 5, 2001. http://www.nngroup.com/articles/first-rule-of-usability-dont-listen-to-users/

Whitefield A, Wilson F, Dowell J (1991) A framework for human factors evaluation. Behav Inf Technol 10(1):65–79

Gould JD, Lewis C (1985) Designing for usability: key principles and what designers think. Commun ACM 28(3):300–311

Wilson GM, Sasse MA (2000) Do users always know what’s good for them? Utilising physiological responses to assess media quality. People and computers XIV—usability or else!. Springer, London, pp 327–339.

Chapter   Google Scholar  

Åborg C, Sandblad B, Gulliksen J, Lif M (2003) Integrating work environment considerations into usability evaluation methods—the ADA approach. Interact Comput 15(3):453–471

Muller MJ, Matheson L, Page C, Gallup R (1998) Methods & tools: participatory heuristic evaluation. Interactions 5(5):13–18

Wright PC, Monk AF (1991) A cost-effective evaluation method for use by designers. Int J Man Mach Stud 35(6):891–912

Dumas JS, Redish JC (1999) A practical guide to usability testing. Intellect Books, Exeter

Følstad A, Law E, Hornbæk K (2012) Analysis in practical usability evaluation: a survey study. In: Chi E, Höök K (eds) Proceedings of the SIGCHI conference on human factors in computing systems, CHI '12. ACM, New York, pp 2127–2136

Smilowitz ED, Darnell MJ, Benson AE (1994) Are we overlooking some usability testing methods? A comparison of lab, beta, and forum tests. Behav Inf Technol 13(1–2):183–190

Vermeeren AP, Law ELC, Roto V, Obrist M, Hoonhout J, Väänänen-Vainio-Mattila K (2010) User experience evaluation methods: current state and development needs. In: Proceedings of the 6th Nordic conference on human-computer interaction: extending boundaries, ACM, New York, p 521–530

Følstad A, Hornbæk K (2010) Work-domain knowledge in usability evaluation: experiences with cooperative usability testing. J Syst Softw 83(11):2019–2030

Cowley JA, Radford-Davenport J (2011) Qualitative data differences between a focus group and online forum hosting a usability design review: a case study. Proceedings of the human factors and ergonomics society annual meeting 55(1): 1356–1360

Jacobsen NE (1999) Usability evaluation methods: the reliability and usage of cognitive walkthrough and usability test. (Doctoral thesis. University of Copenhagen, Denmark)

Cockton G, Lavery D, Woolrych A (2008) Inspection-based evaluations. In: Sears A, Jacko J (eds) The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications, 2nd edn. Lawrence Erlbaum Associates, New York, pp 1171–1190

Mack RL, Nielsen J (1994) Executive summary. In: Nielsen J, Mack RL (eds) Usability inspection methods. Wiley, New York, pp 1–23

Baauw E, Bekker MM, Barendregt W (2005) A structured expert evaluation method for the evaluation of children’s computer games. In: Costabile MF, Paternò F (Eds.) Proceedings of human-computer interaction—INTERACT 2005, lecture notes in computer science 3585, Springer, Berlin, p 457–469

Følstad A (2007) Work-domain experts as evaluators: usability inspection of domain-specific work support systems. Int J Human Comp Interact 22(3):217–245

Frøkjær E, Hornbæk K (2005) Cooperative usability testing: complementing usability tests with user-supported interpretation sessions. In: van der Veer G, Gale C (eds) CHI’05 extended abstracts on human factors in computing systems. ACM Press, New York, pp 1383–1386

Andreasen MS, Nielsen HV, Schrøder SO, Stage J (2007) What happened to remote usability testing? An empirical study of three methods. In: Rosson MB, Gilmore D (Eds.) CHI’97: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 1405–1414

Hertzum M (1999) User testing in industry: a case study of laboratory, workshop, and field tests. In: Kobsa A, Stephanidis C (Eds.) Proceedings of the 5th ERCIM Workshop on User Interfaces for All, Dagstuhl, Germany, November 28–December 1, 1999. http://www.interaction-design.org/references/conferences/proceedings_of_the_5th_ercim_workshop_on_user_interfaces_for_all.html

Rosenbaum S, Kantner L (2007) Field usability testing: method, not compromise. Proceedings of the IEEE international professional communication conference, IPCC 2007. doi: 10.1109/IPCC.2007.4464060

Choe P, Kim C, Lehto MR, Lehto X, Allebach J (2006) Evaluating and improving a self-help technical support web site: use of focus group interviews. Int J Human Comput Interact 21(3):333–354

Greenbaum J, Kyng M (eds) (1991) Design at work. Lawrence Erlbaum Associates, Hillsdale

Desurvire HW, Kondziela JM, Atwood ME (1992) What is gained and lost when using evaluation methods other than empirical testing. In: Monk A, Diaper D, Harrison MD (eds) People and computers VII: proceedings of HCI 92. Cambridge University Press, Cambridge, pp 89–102

Karat CM, Campbell R, Fiegel T (1992) Comparison of empirical testing and walkthrough methods in user interface evaluation. In: Bauersfeld P, Bennett J, Lynch G (Eds.) CHI’92: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 397–404

Gray WD, Salzman MC (1998) Damaged merchandise? A review of experiments that compare usability evaluation methods. Human Comput Interact 13(3):203–261

Hartson HR, Andre TS, Williges RC (2003) Criteria for evaluating usability evaluation methods. Int J Human Comput Interact 15(1):145–181

Law EL-C (2006) Evaluating the downstream utility of user tests and examining the developer effect: a case study. Int J Human Comput Interact 21(2):147–172

Uldall-Espersen T, Frøkjær E, Hornbæk K (2008) Tracing impact in a usability improvement process. Interact Comput 20(1):48–63

Frøkjær E, Hornbæk K (2008) Metaphors of human thinking for usability inspection and design. ACM Trans Comput Human Interact (TOCHI) 14(4):20:1–20:33

Fu L, Salvendy G, Turley L (2002) Effectiveness of user testing and heuristic evaluation as a function of performance classification. Behav Inf Technol 21(2):137–143

Kitchenham B (2004) Procedures for performing systematic reviews (Technical Report TR/SE-0401). Keele, UK: Keele University. http://www.scm.keele.ac.uk/ease/sreview.doc

Harzing AW (2013) A preliminary test of Google Scholar as a source for citation data: a longitudinal study of Nobel prize winners. Scientometrics 94(3):1057–1075

Meho LI, Yang K (2007) Impact of data sources on citation counts and rankings of LIS faculty: web of Science versus Scopus and Google Scholar. J Am Soc Inform Sci Technol 58(13):2105–2125

Følstad A, Anda BC, Sjøberg DIK (2010) The usability inspection performance of work-domain experts: an empirical study. Interact Comput 22:75–87

Bruun A, Gull P, Hofmeister L, Stage J (2009) Let your users do the testing: a comparison of three remote asynchronous usability testing methods. In: Hickley K, Morris MR, Hudson S, Greenberg S (Eds.) CHI’09: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 1619–1628

Christensen L, Frøkjær E (2010) Distributed usability evaluation: enabling large-scale usability evaluation with user-controlled Instrumentation. In: Blandford A, Gulliksen J (Eds.) NordiCHI’10: Proceedings of the 6th Nordic conference on human-computer interaction: extending boundaries, ACM, New York, p 118–127

Bruun A, Stage J (2012) The effect of task assignments and instruction types on remote asynchronous usability testing. In: Chi EH, Höök K (Eds.) CHI’12: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 2117–2126

Hartson H R, Castillo JC (1998) Remote evaluation for post-deployment usability improvement. In: Catarci T, Costabile MF, Santucci G, Tarafino L, Levialdi S (Eds.) AVI98: Proceedings of the working conference on advanced visual interfaces, ACM Press, New York, p 22–29

Petrie H, Hamilton F, King N, Pavan P (2006) Remote usability evaluations with disabled people. In: Grinter R, Rodden T, Aoki P, Cutrell E, Jeffries R, Olson G (Eds.) CHI’06: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 1133–1141

Cunliffe D, Kritou E, Tudhope D (2001) Usability evaluation for museum web sites. Mus Manag Curatorship 19(3):229–252

Sullivan P (1991) Multiple methods and the usability of interface prototypes: the complementarity of laboratory observation and focus groups. In: Proceedings of the Internetional Conference on Systems Documentation—SIGDOC’91, ACM, New York, p 106–112

Donker A, Markopoulos P (2002) A comparison of think-aloud, questionnaires and interviews for testing usability with children. In: Faulkner X, Finlay J, Détienne F (eds) People and computers XVI—memorable yet invisible, proceedings of HCI 202. Springer, London, pp 305–316

Horsky J, McColgan K, Pang JE, Melnikas AJ, Linder JA, Schnipper JL, Middleton B (2010) Complementary methods of system usability evaluation: surveys and observations during software design and development cycles. J Biomed Inform 43(5):782–790

Barcelos TS, Muñoz R, Chalegre V (2012) Gamers as usability evaluators: A study in the domain of virtual worlds. In: Anacleto JC, de Almeida Nedis VP (Eds.) IHC’12: Proceedings of the 11th brazilian symposium on human factors in computing systems, Brazilian Computer Society, Porto Alegre, p 301–304

Edwards PJ, Moloney KP, Jacko JA, Sainfort F (2008) Evaluating usability of a commercial electronic health record: a case study. Int J Hum Comput Stud 66:718–728

Kontio J, Lehtola L, Bragge J (2004) Using the focus group method in software engineering: obtaining practitioner and user experiences. In: Proceedings of the International Symposium on Empirical Software Engineering – ISESE, IEEE, Washington, p 271–280

Obrist M, Moser C, Alliez D, Tscheligi M (2011) In-situ evaluation of users’ first impressions on a unified electronic program guide concept. Entertain Comput 2:191–202

Marsh SL, Dykes J, Attilakou F (2006) Evaluating a geovisualization prototype with two approaches: remoteinstructional vs. face-to-face exploratory. In: Proceedings of information visualization 2006, IEEE, Washington, p 310–315

Ebenezer C (2003) Usability evaluation of an NHS library website. Health Inf Libr J 20(3):134–142

Yeo A (2001) Global-software development lifecycle: an exploratory study. In: Jacko J, Sears A (Eds.) CHI’01: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 104–111

Rector AL, Horan B, Fitter M, Kay S, Newton PD, Nowlan WA, Robinson D, Wilson A (1992) User centered development of a general practice medical workstation: The PEN&PAD experience. In: Bauersfeld P, Bennett J, Lunch G (Eds.) CHI ‘92: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, New York, p 447–453

Smith A, Dunckley L (2002) Prototype evaluation and redesign: structuring the design space through contextual techniques. Interact Comput 14(6):821–843

Ross S, Ramage M, Ramage Y (1995) PETRA: participatory evaluation through redesign and analysis. Interact Comput 7(4):335–360

Lamanauskas L, Pribeanu C, Vilkonis R, Balog A, Iordache DD, Klangauskas A (2007) Evaluating the educational value and usability of an augmented reality platform for school environments: some preliminary results. In: Proceedings of the 4th WSEAS/IASME international conference on engineering education p 86–91

Sylaiou S, Economou M, Karoulis A, White M (2008) The evaluation of ARCO: a lesson in curatorial competence and intuition with new technology. ACM Comput Entertain 6(20):23

Hornbæk K (2010) Dogmas in the assessment of usability evaluation methods. Behav Inf Technol 29(1):97–111

Download references

Acknowledgements

The presented work was supported the Research Council of Norway Grant Numbers 176828 and 203432. Thanks to Professor Kasper Hornbæk for providing helpful and constructive input on the manuscript and for supervising the Ph.D. work on which it is based.

Competing interests

The author declares no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and affiliations.

SINTEF, Forskningsveien 1, 0373, Oslo, Norway

Asbjørn Følstad

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Asbjørn Følstad .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Følstad, A. Users’ design feedback in usability evaluation: a literature review. Hum. Cent. Comput. Inf. Sci. 7 , 19 (2017). https://doi.org/10.1186/s13673-017-0100-y

Download citation

Received : 02 July 2016

Accepted : 18 May 2017

Published : 03 July 2017

DOI : https://doi.org/10.1186/s13673-017-0100-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Usability evaluation
  • User reports
  • Literature review

usability evaluation methods a literature review

A Systematic Literature Review of Usability Evaluation Guidelines on Mobile Educational Games for Primary School Students

  • Conference paper
  • Cite this conference paper

usability evaluation methods a literature review

  • Xiao Wen Lin Gao 14 ,
  • Braulio Murillo 14 &
  • Freddy Paz 14  

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11586))

Included in the following conference series:

  • International Conference on Human-Computer Interaction

3749 Accesses

1 Citations

  • The original version of this chapter was revised: Second author’s family name has been corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-23535-2_46

Recently, mobile educational games became a trend for primary school students, because it makes children learn in an entertaining way, and since nowadays they spend more time with their mobile devices, especially smartphones, the usage of this games is widespread. This kind of games must consider different aspects so these can cover all the purpose that they want to provide, such as usability, playability, learnability, effectiveness, simplicity, and so on. That is why the usability evaluation plays an important role in it. However, despite the fact that a lot of usability evaluation methods exist, most of them are focused on traditional computer usage and those are not 100% compatible with mobile phone usage. Therefore, a systematic literature review was conducted in order to identify usability evaluation guidelines for mobile educational games, which are concerning primary school students as users. This work is the first step toward making a set of usability guidelines for the evaluation of mobile educational games for Primary school students.

You have full access to this open access chapter,  Download conference paper PDF

Similar content being viewed by others

usability evaluation methods a literature review

The Evolution of Educational Game Designs From Computers to Mobile Devices: A Comprehensive Review

usability evaluation methods a literature review

Pre-MEGa: A Proposed Framework for the Design and Evaluation of Preschoolers’ Mobile Educational Games

usability evaluation methods a literature review

Design of Educational Games: The Evolution from Computers to Mobile Devices

  • Systematic literature review
  • Usability guidelines
  • Usability evaluation
  • Mobile educational games
  • Primary school students

1 Introduction

Usability refers the extent to which a product can be used with efficiency, effectiveness and satisfaction in a specific use context and where the user meets certain goals with the use of this product [ 1 ]. This topic is very important in educational applications, given that if it has a high level of usability, it would effectively support learning and cause a positive impact by motivating students to learn, therefore the success or failure of these applications [ 2 ].

The challenge of educational games is the fact that the subconsciousness of people normally transmits the feeling that something related to education will not be entertaining even if it is related with any word that sounds fun by itself, as the word “video game”, and that discourages people from using them [ 3 ]. One of the crucial factors is because usually, it is often to omit characteristics or features related to playability in most of the educational video games, as they pretend to have a strong educational intention [ 4 ]. For this reason, usability plays a significant role in those kinds of games, helping the developers improve the visual aspects or the playability of them, in order to achieve higher acceptance of the players.

In this study, we present a systematic literature review to identify guidelines of usability evaluation on mobile educational games for primary school students and the aspects that are considered relevant in the usability evaluation of mobile educational video games. Through this procedure, we can find out the state of the art of mobile educational games guidelines for usability evaluation. It is intended to serve as a literature base for future work that want to create new set of mobile educational games guidelines of usability evaluation.

The paper has the following structure. In Sect.  2 , we describe the main concepts related to our topic. In Sect.  3 , we present the methodology used to undertake this study. In Sect.  4 , we present the results of our research. Finally, we present the conclusions in Sect.  5 .

2 Background

2.1 usability.

According to the ISO/IEC 9126-1 standard [ 5 ], usability is defined as “the capability of the software product to be understood, learned, used and attractive to the user, when used under specified conditions.”

Another definition of usability is given by Nielsen [ 6 ], who defined it as “a quality attribute that assesses how easy user interfaces are to use.” This author mentioned that the word usability also refers to “methods for improving ease-of-use during the design process”, and defined it by 5 quality components:

Learnability: How easy is it for users to accomplish basic tasks the first time they encounter the design?

Efficiency: Once users have learned the design; how quickly can they perform tasks?

Memorability: When users return to the design after a period of not using it, how easily can they reestablish proficiency?

Errors: How many errors do users make, how severe are these errors, and how easily can they recover from the errors?

Satisfaction: How pleasant is it to use the design?

2.2 Usability Evaluation

An evaluation method is a procedure composed by a series of activities well defined with the purpose of collect user data related to the interaction of a final user with a software product and to understand how specific features of this software contributes to achieve a certain degree of usability [ 7 ].

Although there are several taxonomies to classify the usability evaluation methods, these can be classified broadly into two main groups: empirical methods and inspection methods [ 8 ].

The empirical methods are based on capturing and analyzing the usage data from a group of representative users. While these users perform a series of predefined tasks, an evaluator, that could be human or a specific software, is registering the results of their actions. From the analysis of these collected results, it can provide us valuable information to detect usability problems [ 7 ].

On the other hand, the inspection methods are executed by expert evaluators or designers, which do not require the participation of real end users. These methods are based on examining the usability aspects of the user interface with a set of guidelines. These guidelines not only can review the compliance level of certain usability attributes, but also can predict problems related to software interfaces into a heuristic evaluation [ 9 ].

2.3 Game-Based Learning

According to Pho and Dinscore, game-based learning is a trend that has been implemented in many settings including workplace training, education, and social media [ 10 ]. It converts users into designers of their own learning environment using video games as a means. Many researches show successful results from innovative educational practices mediated by video games. Also, these studies highlight the positive impact in reasoning ability of children’s education and the development of complex capabilities, such as leadership or cooperation using video games in Primary school students [ 11 ].

3 Systematic Literature Review

A systematic literature review is a methodology that identify, synthesize and interpret all available studies that are relevant to a research question formulated previously, or topic area, or phenomenon of interest [ 12 ]. Although systematic reviews require more effort than traditional reviews, the advantages undertaking this method are greater. It can identify any gaps in current research and summarize the existing evidence in the literature in order to help further investigations. The aim of this work is to identify relevant studies about guidelines of usability evaluation applied to mobile video games of the educational domain focused on primary school children. In addition, to identify educational games aspects that are commonly considered part of usability evaluation criteria. This work was based on the guidelines proposed by Kitchenham and Charters [ 13 ] for performing systematic literature reviews in the field of Software Engineering. The steps of this methodology are documented below.

3.1 Research Questions

The research questions formulated to this study are:

RQ1: What guidelines are used to measure the usability of educational video games for smartphones?

RQ2: What aspects of educational video games are considered in the usability evaluation?

In order to elaborate the search string, we defined general concepts using PICOC method. The “comparison” criterion was not considered, because the focus of this research was not comparing “interventions”. The definition of each criterion is detailed in Table  1 .

4 Search Process

Based on our research questions, we determined a set of terms and grouped them according to each criterion of PICOC. In order to get current studies in the review and relevant to the state of the art of usability guidelines for mobile educational games, we only considered studies whose publication year was after 2014. The search string defined was the following one:

(“Educational game app” OR “educational video game” OR “mobile educational game” OR “educational game” OR “educational touchscreen application” OR “educational smartphone application” OR “educational smartphone game” OR “mobile games for learning” OR “game app for learning” OR “teaching with mobile games” OR “teaching with game apps”) AND (“children” OR “primary school student” OR “primary school”) AND (“Usability” OR “Interface” OR “User Interface” OR “UX” OR “User Experience”) AND (“Methodology” OR “Method” OR “Framework” OR “Guidelines” OR “Principles” OR “design” OR “evaluation” OR “user interface” OR “study”) AND (publication year  >  2013)

The search process was performed by using three recognized databases in order to obtain the relevant studies: SCOPUS, Springer y IEEExplorer. The search string was adapted according to the instructions of each search engine. No additional study was considered.

4.1 Inclusion and Exclusion Criteria

Every article that was obtained from the search, was analyzed by its title, abstract and keywords, in order to determine its inclusion in the review, if the proposal presented was focused on educational video games developed in smartphones and had been applied for children of primary school. Additionally, we analyzed whether their content was about guidelines of usability evaluation or usability in general to decide if it had to be included as relevant studies in the context of the systematic review.

Articles that match any of the following items were excluded from this review: (1) the study is not apply for mobile video games, but for computers or in - person games, (2) the study is not about usability, and (3) the study is not written in English.

4.2 Data Collection

After the application of the procedure in the databases, 910 results were found, in which 36 studies were selected for the review process. The obtained studies were filtered based on our inclusion and exclusion criteria. Table  2 shows the summary of the amount of studies that were found in the search process, and Table  3 shows the list of selected studies.

5 Data Analysis and Results

In order to determine the relevant studies for the present work, we have identified those which have their main topic as educational video games and divide into subtopics that are relevant for our analysis. The selected results are presented below in Table  4 .

5.1 Usability Evaluation Methods

Based on the articles obtained in the systematic review, it was observed that the most common methods for usability evaluation of mobile applications were through heuristics, metrics and questionnaires. Most of the usability evaluations were based on a specific application, with an emphasis on the user interface aspect.

A study reveals that they have found evaluation models specifically for the aspect of playability, such as the SEEM model and entertaining factors evaluation metrics proposed by Read, Macfarlane and Casey [ 14 ].

USERBILITY model , a model that evaluates the user experience (UX) and usability of mobile applications in general, using generic heuristics based on the Nielsen model. Although this model has been designed for mobile applications, it does not take into account the distinctive characteristics of the mobile environment. For the evaluation of user experience, Userbility uses the 3E model (Expressions, Emotions and Experiences), this model is also a generic model for evaluating the user experience, and is not designed especially for mobile applications [ 15 ].

HECE , a model to evaluate the dimensions of playability for children, that uses the Nielsen model as a basis and adds aspects of usability for children, where it also assesses aspects such as learning ability for children and if it is appropriate for them [ 16 ]. The authors use this model to apply it in the development of usability evaluation for m-GBL in primary schools.

A study has established guidelines for Graphical User Interface (GUI) for the design and development of mobile application prototypes focused on children with hearing impairment. Additionally, they applied an usability evaluation using the inspection method, taking into account three types of user profiles: specialists in children with hearing disabilities, designers and developers. They take into consideration aspects of identity criteria, design, accessibility and the development of the GUI design guide for mobile applications aimed at children with hearing disabilities [ 17 ].

The MEEGA model is intended to be used in case studies that begin with the treatment of educational videogames, and after playing the game, the MEEGA questionnaire is answered by the apprentices in order to collect the respective data [ 18 ].

The Usability Testing , consists in Nielsen’s model of evaluating the aspects of ease of learning, efficiency, memorability, errors and satisfaction [ 19 ].

5.2 Aspects Considered in the Usability Evaluation of Mobile Educational Video Games

We found that in addition to the five main aspects in the traditional usability evaluation [ 6 ], aspects such as visibility, game logic, playability, simplicity and learning capacity are also taken into consideration. Table  5 shows the number of studies found in this research of each aspect that are relevant to our domain.

Below are the studies that substantiate the importance of the aspects selected as relevant:

Since it is about gaming applications, playability would be the fundamental characteristic for the usability evaluation. This plays a very important role in children’s learning, since their natural way of learning is through experience [ 14 ].

Mozelius indicates a set of key factors for the design of mobile educational games, which are: simplicity , mobility , usability , playability , gradual increase of game levels , practical and conceptual understanding , collaboration , competition , among others [ 20 ].

From Padilla’s point of view, it must take into consideration the aspect of playability for an educational video game to be really effective, which makes it attractive to users, and the learning capacity , which allows users to obtain an educational benefit from the game [ 3 ].

According to Maqsood, Mekhail and Chiasson, they take into consideration aspects such as the length of the game content , the relevance of the themes , the visual design and the learning capacity , important for an educational game. They mention that the group of children who participated in the study case, preferred designs with characters that look older, because they felt that they can teach them about situations that may be found in the future. In addition, the colors used in the game also play an important role, since they can influence the perceptions that the players have [ 21 ].

According to the results of Cruz‘s study, who applied a questionnaire to evaluate usability in his mobile video game, which involves practicing reasoning of arithmetic operations. It is observed that aspects such as visibility of the system state and the consistency of elements such as buttons are not so relevant for the players. However, aspects such as the competitiveness of multiplayer and sound effect s are important factors, as it motivates the player to continue playing and therefore generate satisfaction in the players. Another point that highlights the game is that not only has the logical reasoning part, but also has help with strategies, so players can have easier situations to win the game, in which this is related to the appearance of simplicity of the game [ 22 ].

In a study of Drosos, applied to a serious 3D game, the students who tried the game mentioned that it was very flat and that it did not have characteristics that increased the pleasure of playing it. So, it can be said, that the playability plays an important role in educational games. Although, the majority has mentioned that they like the 3D design of the game , and that it was a pleasant educational experience learning new concepts about El Greco, which was the central theme of the game [ 23 ].

6 Conclusions and Future Works

Based on the information obtained from the systematic literature review, we present the state of the art of usability evaluation methods for mobile educational video games. In addition, we identify the impact that these mobile applications have on primary school children lives nowadays. Studies related to m-learning have been found, either in a general way or applied to a specific educational level. However, there are not many studies that link the importance of m-learning with the game usability. The existing usability evaluation methods seems not to concern all the aspects about a mobile educational game. Some studies about the development of new mobile educational game evaluate it usability adapting general usability heuristics. The results indicate a need for new sets of guidelines of usability evaluation for mobile educational games, especially focused on primary school children, since they are mainly the target audience of this kind of applications.

Bevan, N., Carter, J., Harker, S.: ISO 9241-11 revised: what have we learnt about usability since 1998? In: Kurosu, M. (ed.) HCI 2015. LNCS, vol. 9169, pp. 143–151. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20901-2_13

Chapter   Google Scholar  

Valdez-Velazquez, L.L., Gomez-Sandoval, Z.: A usability study of educational molecular visualization on smart phones (2014)

Google Scholar  

Padilla Zea, N.: Metodología para el diseño de videojuegos educativos sobre una arquitectura para el análisis del aprendizaje colaborativo. Metodología para el diseño de videojuegos educativos sobre una arquitectura para el análisis del aprendizaje colaborativo (2011)

González Sánchez, J.L.: Caracterización de la experiencia del jugador en video juegos. Editorial de la Universidad de Granada (2010)

ISO/IEC 9126-1: ISO/IEC 9126-1:2001 - Software engineering – Product quality – Part 1: Quality model (2001). https://www.iso.org/standard/22749.html . Accessed 05 Oct 2018

Nielsen, J.: Usability 101: Introduction to Usability (2012). https://www.nngroup.com/articles/usability-101-introduction-to-usability/ . Accessed 30 Sept 2018

Fernandez, A., Insfran, E., Abrahão, S.: Usability evaluation methods for the web: a systematic mapping study. Inf. Softw. Technol. 53 (8), 789–817 (2011)

Article   Google Scholar  

Insfran, E., Fernandez, A.: A systematic review of usability evaluation in web development. In: Hartmann, S., Zhou, X., Kirchberg, M. (eds.) WISE 2008. LNCS, vol. 5176, pp. 81–91. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85200-1_10

Paz, F., Pow-Sang, J.A.: Usability evaluation methods for software development: a systematic mapping review. In: Proceedings of 8th International Conference on Advances Software Engineering and Its Applications, ASEA 2015, vol. 10, no. 1, pp. 1–4 (2016)

Pho, A., Dinscore, A.: Game-Based Learning (2015)

del Moral Pérez, M.E., Guzmán-Duque, A.P., Fernández, L.C.: Proyecto game to learn: aprendizaje basado enjuegos para potenciar las inteligencias lógico-matemática, naturalista y lingüística en educaciónprimaria. Pixel-Bit. Rev. Medios y Educ., no. 49 (2016)

Kitchenham, B.: Procedures for performing systematic reviews. Keele UK Keele Univ. 33 , 1–26 (2004)

Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in software engineering version. Engineering 45 (4ve), 1051 (2007)

Al Fatta, H., Maksom, Z., Zakaria, M.H.: Systematic literature review on usability evaluation model of educational games : playability, pedagogy, and mobility aspects 1. J. Theor. Appl. Inf. Technol. 31 (14) (2018)

Nascimento, I., Silva, W., Gadelha, B., Conte, T.: Userbility: a technique for the evaluation of user experience and usability on mobile applications. In: Kurosu, M. (ed.) HCI 2016. LNCS, vol. 9731, pp. 372–383. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39510-4_35

Alsumait, A., Al-Osaimi, A.: Usability heuristics evaluation for child e-learning applications. In: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services – iiWAS 2009, p. 425 (2009)

Muñoz, L.J.E., et al.: Graphical user interface design guide for mobile applications aimed at deaf children. In: Zaphiris, P., Ioannou, A. (eds.) LCT 2018. LNCS, vol. 10924, pp. 58–72. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91743-6_4

Petri, G., Gresse von Wangenheim, C., Ferreti Borgatto, A.: A large-scale evaluation of a model for the evaluation of games for teaching software engineering. In: 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering Education and Training Track (ICSE-SEET), pp. 180–189 (2017)

Adnan, F., Prasetyo, B., Nuriman, N.: Usability testing analysis on the Bana game as education game design references on junior high school. J. Pendidik. IPA Indones. 6 (1) (2017)

Mozelius, P., Torberg, D., Castillo, C.C.: An Educational Game for Mobile Learning-Some Essential Design Factors (2015). books.google.com

Maqsood, S., Mekhail, C., Chiasson, S.: A day in the life of JOS. In: Proceedings of the 17th ACM Conference on Interaction Design and Children - IDC 2018, pp. 241–252 (2018)

Cruz, B., Marchesini, P., Gatto, G., Souza-Concilio, I.: A mobile game to practice arithmetic operations reasoning. In: 2018 IEEE Global Engineering Education Conference (EDUCON), pp. 2003–2008 (2018)

Drosos, V., Alexandri, A., Tsolis, D., Alexakos, C.: A 3D serious game for cultural education. In 2017 8th International Conference on Information, Intelligence, Systems and Applications (IISA), pp. 1–5 (2017)

Download references

Author information

Authors and affiliations.

Pontificia Universidad Católica del Perú, San Miguel, Lima 32, Lima, Peru

Xiao Wen Lin Gao, Braulio Murillo & Freddy Paz

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Xiao Wen Lin Gao .

Editor information

Editors and affiliations.

Aaron Marcus and Associates, Berkeley, CA, USA

Aaron Marcus

Zuoyebang, K12 education, Beijing, China

Wentao Wang

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Cite this paper.

Lin Gao, X.W., Murillo, B., Paz, F. (2019). A Systematic Literature Review of Usability Evaluation Guidelines on Mobile Educational Games for Primary School Students. In: Marcus, A., Wang, W. (eds) Design, User Experience, and Usability. Practice and Case Studies. HCII 2019. Lecture Notes in Computer Science, vol 11586. Springer, Cham. https://doi.org/10.1007/978-3-030-23535-2_13

Download citation

DOI : https://doi.org/10.1007/978-3-030-23535-2_13

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-23534-5

Online ISBN : 978-3-030-23535-2

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Open access
  • Published: 07 May 2013

Usability of mobile applications: literature review and rationale for a new usability model

  • Rachel Harrison 1 ,
  • Derek Flood 1 &
  • David Duce 1  

Journal of Interaction Science volume  1 , Article number:  1 ( 2013 ) Cite this article

192k Accesses

347 Citations

3 Altmetric

Metrics details

The usefulness of mobile devices has increased greatly in recent years allowing users to perform more tasks in a mobile context. This increase in usefulness has come at the expense of the usability of these devices in some contexts. We conducted a small review of mobile usability models and found that usability is usually measured in terms of three attributes; effectiveness, efficiency and satisfaction. Other attributes, such as cognitive load, tend to be overlooked in the usability models that are most prominent despite their likely impact on the success or failure of an application. To remedy this we introduces the PACMAD (People At the Centre of Mobile Application Development) usability model which was designed to address the limitations of existing usability models when applied to mobile devices. PACMAD brings together significant attributes from different usability models in order to create a more comprehensive model. None of the attributes that it includes are new, but the existing prominent usability models ignore one or more of them. This could lead to an incomplete usability evaluation. We performed a literature search to compile a collection of studies that evaluate mobile applications and then evaluated the studies using our model.

Introduction

Advances in mobile technology have enabled a wide range of applications to be developed that can be used by people on the move. Developers sometimes overlook the fact that users will want to interact with such devices while on the move. Small screen sizes, limited connectivity, high power consumption rates and limited input modalities are just some of the issues that arise when designing for small, portable devices. One of the biggest issues is the context in which they are used. As these devices are designed to enable users to use them while mobile, the impact that the use of these devices has on the mobility of the user is a critical factor to the success or failure of the application.

Current research has demonstrated that cognitive overload can be an important aspect of usability [ 1 , 2 ]. It seems likely that mobile devices may be particularly sensitive to the effects of cognitive overload, due to their likely deployment in multiple task settings and limitations of size. This aspect of usability is often overlooked in existing usability models, which are outlined in the next section, as these models are designed for applications which are seldom used in a mobile context. Our PACMAD usability model for mobile applications, which we then introduce, incorporates cognitive load as this attribute directly impacts and may be impacted by the usability of an application.

A literature review, outlined in the following section, was conducted as validation of the PACMAD model. This literature review examined which attributes of usability, as defined in the PACMAD usability model, were used during the evaluation of mobile applications presented in a range of papers published between 2008 and 2010. Previous work by Kjeldskov & Graham [ 3 ] has looked at the research methods used in mobile HCI, but did not examine the particular attributes of usability incorporated in the PACMAD model. We also present the results of the literature review.

The impact of this work on future usability studies and what lessons other researchers should consider when performing usability evaluations on mobile applications are also discussed.

Background and literature review

Existing models of usability.

Nielsen [ 4 ] identified five attributes of usability:

  Efficiency : Resources expended in relation to the accuracy and completeness with which users achieve goals;

  Satisfaction : Freedom from discomfort, and positive attitudes towards the use of the product.

  Learnability : The system should be easy to learn so that the user can rapidly start getting work done with the system;

  Memorability : The system should be easy to remember so that the casual user is able to return to the system after some period of not having used it without having to learn everything all over again;

  Errors : The system should have a low error rate, so that users make few errors during the use of the system and that if they do make errors they can easily recover from them. Further, catastrophic errors must not occur.

In addition to this Nielsen defines Utility as the ability of a system to meet the needs of the user. He does not consider this to be part of usability but a separate attribute of a system. If a product fails to provide utility then it does not offer the features and functions required; the usability of the product becomes superfluous as it will not allow the user to achieve their goals. Likewise, the International Organization for Standardization (ISO) defined usability as the “Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use” [ 5 ]. This definition identifies 3 factors that should be considered when evaluating usability.

  User : Person who interacts with the product;

  Goal : Intended outcome;

  Context of use : Users, tasks, equipment (hardware, software and materials), and the physical and social environments in which a product is used.

Each of the above factors may have an impact on the overall design of the product and in particular will affect how the user will interact with the system. In order to measure how usable a system is, the ISO standard outlines three measurable attributes:

  Effectiveness : Accuracy and completeness with which users achieve specified goals;

Unlike Nielsen’s model of usability, the ISO standard does not consider Learnability, Memorability and Errors to be attributes of a product’s usability although it could be argued that they are included implicitly within the definitions of Effectiveness, Efficiency and Satisfaction. For example, error rates can be argued to have a direct effect on efficiency.

Limitations for mobile applications

The models presented above were largely derived from traditional desktop applications. For example, Nielsen’s work was largely based on the design of telecoms systems, rather than computer software. The advent of mobile devices has presented new usability challenges that are difficult to model using traditional models of usability. Zhang and Adipat [ 6 ] highlighted a number of issues that have been introduced by the advent of mobile devices:

  Mobile Context : When using mobile applications the user is not tied to a single location. They may also be interacting with nearby people, objects and environmental elements which may distract their attention.

  Connectivity : Connectivity is often slow and unreliable on mobile devices. This will impact the performance of mobile applications that utilize these features.

  Small Screen Size : In order to provide portability mobile devices contain very limited screen size and so the amount of information that can be displayed is limited.

  Different Display Resolution : The resolution of mobile devices is reduced from that of desktop computers resulting in lower quality images.

  Limited Processing Capability and Power : In order to provide portability, mobile devices often contain less processing capability and power. This will limit the type of applications that are suitable for mobile devices.

  Data Entry Methods : The input methods available for mobile devices are different from those for desktop computers and require a certain level of proficiency. This problem increases the likelihood of erroneous input and decreases the rate of data entry.

From our review it is apparent that many existing models for usability do not consider mobility and its consequences, such as additional cognitive load. This complicates the job of the usability practitioner, who must consequently define their task model to explicitly include mobility. One might argue that the lack of reference to a particular context could be a strength of a usability model provided that the usability practitioner has the initiative and knows how to modify the model for a particular context.

The PACMAD usability model aims to address some of the shortcomings of existing usability models when applied to mobile applications. This model builds on existing theories of usability but is tailored specifically for applications that can be used on mobile devices. The PACMAD usability model is depicted in Figure  1 side by side with Nielsen’s and the ISO’s definition of usability. The PACMAD usability model incorporates the attributes of both the ISO standard and Nielsen’s model and also introduces the attribute of cognitive load which is of particular importance to mobile applications. The following section introduces the PACMAD usability model and describes in detail each of the attributes of usability mentioned below as well as the three usability factors that are part of this model: user, task and context.

figure 1

Comparison of usability models.

The PACMAD usability model for mobile applications identifies three factors (User, Task and Context of use) that should be considered when designing mobile applications that are usable. Each of these factors will impact the final design of the interface for the mobile application. In addition to this the model also identifies seven attributes that can be used to define metrics to measure the usability of an application. The following section outlines each of these factors and attributes in more detail.

Factors of usability

The PACMAD usability model identifies three factors which can affect the overall usability of a mobile application: User , Task and Context of use . Existing usability models such as those proposed by the ISO [ 5 ] and Nielsen [ 4 ] also recognise these factors as being critical to the successful usability of an application. For mobile applications Context of use plays a critical role as an application may be used in multiple, very different contexts.

User It is important to consider the end user of an application during the development process. As mobile applications are usually designed to be small, the traditional input methods, such as a keyboard and mouse, are no longer practical. It is therefore necessary for application designers to look at alternative input methods. Some users may find it difficult to use some of these methods due to physical limitations. For example it has been shown [ 7 ] that some Tetraplegic users who have limited mobility in their upper extremities tend to have high error rates when using touch screens and this may cause unacceptable difficulties with certain (usually small) size targets.

Another factor that should be considered is the user’s previous experience. If a user is an expert at the chosen task then they are likely to favour shortcut keys to accomplish this task. On the other hand novice users may prefer an interface that is intuitive and easy to navigate and which allows them to discover what they need. This trade-off must be considered during the design of the application.

Task The word task refers here to the goal the user is trying to accomplish with the mobile application. During the development of applications, additional features can be added to an application in order to allow the user to accomplish more with the software. This extra functionality comes at the expense of usability as these additional features increase the complexity of the software and therefore the user’s original goal can become difficult to accomplish.

For example, consider a digital camera. If a user wants to take a photograph, they must first select between different modes (e.g. video, stills, action, playback, etc.) and then begin to line up the shot. This problem is further compounded if the user needs to take a photograph at night and needs to search through a number of menu items to locate and turn on a flashlight.

Context of use The word context refers here to the environment in which the user will use the application. We want to be able to view context separately from both the user and the task. Context not only refers to a physical location but also includes other features such as the user’s interaction with other people or objects (e.g. a motor vehicle) and other tasks the user may be trying to accomplish. Research has shown that using mobile applications while walking can slow down the walker’s average walking speed [ 8 ]. As mobile applications can be used while performing other tasks it is important to consider the impact of using the mobile application in the appropriate context.

Attributes of usability

The PACMAD usability model identifies 7 attributes which reflect the usability of an application: Effectiveness , Efficiency , Satisfaction , Learnability , Memorability , Errors and Cognitive load . Each of these attributes has an impact on the overall usability of the application and as such can be used to help assess the usability of the application.

Effectiveness Effectiveness is the ability of a user to complete a task in a specified context. Typically effectiveness is measured by evaluating whether or not participants can complete a set of specified tasks.

Efficiency Efficiency is the ability of the user to complete their task with speed and accuracy. This attribute reflects the productivity of a user while using the application. Efficiency can be measured in a number of ways, such as the time to complete a given task, or the number of keystrokes required to complete a given task.

Satisfaction Satisfaction is the perceived level of comfort and pleasantness afforded to the user through the use of the software. This is reflected in the attitudes of the user towards the software. This is usually measured subjectively and varies between individual users. Questionnaires and other qualitative techniques are typically used to measure a user’s attitudes towards a software application.

Learnability A recent survey of mobile application users [ 9 ] found that users will spend on average 5 minutes or less learning to use a mobile application. There are a large number of applications available on mobile platforms and so if users are unable to use an application they may simply select a different one. For this reason the PACMAD model includes the attribute Learnability as suggested by Nielsen.

Learnability is the ease with which a user can gain proficiency with an application. It typically reflects how long it takes a person to be able to use the application effectively. In order to measure Learnability, researchers may look at the performance of participants during a series of tasks, and measure how long it takes these participants to reach a pre-specified level of proficiency.

Memorability The survey also found that mobile applications are used on an infrequent basis and that participants used almost 50% of the applications only once a month [ 9 ]. Thus there may be a large period of inactivity between uses and so participants may not easily recall how to use the application. Consequently the PACMAD usability model includes the attribute of Memorability as also suggested by Nielsen.

Memorability is the ability of a user to retain how to use an application effectively. Software might not be used on a regular basis and sometimes may only be used sporadically. It is therefore necessary for users to remember how to use the software without the need to relearn it after a period of inactivity. Memorability can be measured by asking participants to perform a series of tasks after having become proficient with the use of the software and then asking them to perform similar tasks after a period of inactivity. A comparison can then be made between the two sets of results to determine how memorable the application was.

Errors The PACMAD usability model extends the description of Errors, first proposed by Nielsen, to include an evaluation of the errors that are made by participants while using mobile apps. This allows developers to identify the most troublesome areas for users and to improve these areas in subsequent iterations of development. This attribute is used to reflect how well the user can complete the desired tasks without errors. Nielsen [ 4 ] states that users should make few errors during the use of a system and that if they do make errors they should be able to easily recover from them. The error rate of users may be used to infer the simplicity of a system. The PACMAD usability model considers the nature of errors as well as the frequency with which they occur. By understanding the nature of these errors it is possible to prevent these errors from occurring in future versions of the application.

Cognitive load The main contribution of the PACMAD model is its inclusion of Cognitive Load as an attribute of usability. Unlike traditional desktop applications, users of mobile applications may be performing additional tasks, such as walking, while using the mobile device. For this reason it is important to consider the impact that using the mobile device will have on the performance of the user of these additional tasks. For example a user may wish to send a text message while walking. In this case the user’s walking speed will be reduced as they are concentrating on sending the message which is distracting them from walking.

Cognitive load refers to the amount of cognitive processing required by the user to use the application. In traditional usability studies a common assumption is that the user is performing only a single task and can therefore concentrate completely on that task. In a mobile context users will often be performing a second action in addition to using the mobile application [ 8 , 10 ]. For example a user may be using a stereo while simultaneously driving a car. In this scenario it is important that the cognitive load required by the mobile application, in this case the stereo, does not adversely impact the primary task.

While the user is using the application in a mobile context it will impact both the user’s ability to move and to operate the mobile application. Therefore it is important to consider both dimensions when studying the usability of mobile applications. One way this can be measured is through the NASA Task Load Index (TLX) [ 11 ]. This is a subjective workload assessment tool for measuring the cognitive workload placed on a user by the use of a system. In this paper we adopt a relatively simple view of cognitive load. For a more accurate assessment it may be preferable to adopt a more powerful multi-factorial approach [ 1 , 12 ] but this is beyond the scope of this paper.

Literature review

In order to evaluate the appropriateness and timeliness of the PACMAD usability model for mobile applications, a literature review was conducted to review current approaches and to determine the need for a comprehensive model that includes cognitive load. We focused on papers published between 2008 and 2010 which included an evaluation of the usability of a mobile application.

Performing the literature review

The first step in the literature review was to collect all of the publications from the identified sources. These sources were identified by searching the ACM digital library, IEEE digital library and Google Scholar. The search strings used during these searches were “ Mobile Application Evaluations ”, “ Usability of mobile applications ” and “ Mobile application usability evaluations ”. The following conferences and journals were identified as being the most relevant sources: the Mobile HCI conference (MobileHCI), the International Journal of Mobile Human Computer Interaction (IJMHCI), the ACM Transactions on Computer-Human Interaction (TOCHI), the International Journal of Human Computer Studies (IJHCS), the Personal and Ubiquitous Computing journal (PUC), and the International Journal of Human-Computer Interaction (IJHCI). We also considered the ACM Conference on Human Factors in Computing Systems (CHI) and the IEEE Transactions on Mobile Computing (IEEE TOMC). These sources were later discarded as very few papers (less than 5% of the total) were relevant.

The literature review was limited to the publications between the years 2008 and 2010 due to the emergence of smart phones during this time. Table  1 shows the number of publications that were examined from each source.

The sources presented above included a number of different types of publications (Full papers, short papers, doctoral consortium, editorials, etc.). We focused the study only on full or short research papers from peer reviewed sources. This approach was also adopted by Budgen et al. [ 13 ]. Table  2 shows the number of remaining publications by source.

The abstract of each of the remaining papers was examined to determine if the paper:

Conducted an evaluation of a mobile application/device;

Contained some software component with which the users interact;

Conducted an evaluation which was focused on the interaction with the application or device;

Publications which did not meet the above criteria were removed.

The following exclusion criteria were u sed to exclude papers:

Focused only on application development methodologies and techniques;

Contained only physical interaction without a software component;

Examined only social aspects of using mobile applications;

Did not consider mobile applications.

Each abstract was reviewed by the first two authors to determine if it should be included within the literature review. When a disagreement arose between the reviewers it was discussed until mutual agreement was reached. A small number of relevant publications were unavailable to the authors. Table  3 shows the number of papers included within the literature review by source.

Each of the remaining papers was examined by one reviewer (either the first or second author of this paper). The reviewer examined each paper in detail and identified for each one:

 The attribute of usability that could be measured through the collected metrics;

 The focus of the research presented.

 The type of study conducted;

To ensure the quality of the data extraction performed the first and second author independently reviewed a 10% sample and compared these results. When a disagreement arose it was discussed until an agreement was reached.

Twenty papers that were identified as being relevant did not contain any formal evaluations of the proposed technologies. The results presented below exclude these 20 papers. In addition to this some papers presented multiple studies. In these cases each study was considered independently and so the results based on the number of studies within the evaluated papers rather than the number of papers.

Limitations

This literature review is limited for a number of reasons. Firstly a small number of papers were unavailable to the researchers (8 out of 139 papers considered relevant). This unavailability of less than 6% of the papers probably does not have a large impact on the results presented. By omitting certain sources from the study a bias may have been introduced. We felt that the range of sources considered was a fair representation of the field of usability of mobile applications although some outlying studies may have been omitted due to limited resources. Our reviews of these sources led us to believe that the omitted papers were of borderline significance. Ethical approval for this research was given by Oxford Brookes University Research Ethics Committee.

Research questions

To evaluate the PACMAD usability model three Research Questions (RQ1 to RQ3) were established to determine how important each of the factors and attributes of usability are in the context of mobile applications.

RQ1: What attributes are used when considering the usability of mobile applications?

This research question was established to discover what attributes are typically used to analyse mobile applications and which metrics are associated with them. The answers to this question provide evidence and data for the PACMAD usability model.

RQ2: To what extent are the factors of usability considered in existing research?

In order to determine how research in mobile applications is evolving, RQ2 was established to examine the current research trends into mobile applications, with a particular focus on the factors that affect usability.

In addition to this we wanted to establish which research methods are most commonly used when evaluating mobile applications. For this reason, a third research question was established.

RQ3: What research methodologies are used to evaluate the usability of mobile applications?

There are many ways in which mobile applications can be evaluated including controlled studies, field studies, ethnography, experiments, case-studies, surveys, etc. This research question aims to identify the most common research methodologies used to evaluate mobile apps. The answers to this question will throw light on the maturity of the mobile app engineering field.

The above research questions were answered by examining the literature on mobile applications. The range of literature on the topic of mobile applications is so broad it was important to limit the literature review to the most relevant and recent publications and to limit the publication interval to papers published between 2008 and 2010.

Table  4 shows the percentage of studies that include metrics, such as time to complete a given task, which either directly or indirectly assesses the attributes of usability included within the PACMAD usability model. In some cases the studies evaluated multiple attributes of usability and therefore the results above present both the percentage and the number of studies in which each attribute was considered. These studies often do not explicitly cite usability or any usability related criteria, and so the metrics used for the papers’ analyses were used to discover the usability attributes considered. This lack of precision is probably due to a lack of agreement as to what constitutes usability and the fact that the attributes are not orthogonal. The three most common attributes, Effectiveness, Efficiency and Satisfaction, correspond to the attributes identified by the ISO’s standard for usability.

One of the reasons these attributes are so widely considered is their direct relationship to the technical capabilities of the system. Both Effectiveness and Efficiency are related to the design and implementation of the system and so are usually tested thoroughly. These attributes are also relatively easy to measure. In most cases the Effectiveness of the system is evaluated by monitoring whether a user can accomplish a pre-specified task. Efficiency can be measured by finding the time taken by the participant to complete this task. Questionnaires and structured interviews can be used to determine the Satisfaction of users towards the system. Approximately 22% of the papers reviewed evaluated all three of these attributes.

The focus on these attributes of usability implies that Learnability, Memorability, Errors, and Cognitive load, are considered to be of less importance than Effectiveness, Efficiency and Satisfaction. Learnability, Memorability, Errors, and Cognitive load are not easy to evaluate and this may be why their assessment is often overlooked. As technology matures designers have begun to consider usability earlier in the design process. This is reflected to a certain extent by technological changes away from command line towards GUI based interfaces.

The aspects of usability that were considered least often in the papers reviewed are Learnability and Memorability. There are numerous reasons for this. The nature of these attributes demands that they are evaluated over periods of time. To effectively measure Learnability, users’ progress needs to be checked at regular intervals or tracked over many completions of a task. In the papers reviewed, Learnability was usually measured indirectly by the changes in effectiveness or efficiency over many completions of a specified task.

Memorability was only measured subjectively in the papers reviewed. One way to objectively measure Memorability is to examine participants’ use of the system after a period of inactivity with the system. The practical problem of recruiting participants who are willing to return multiple times to participate in an evaluation is probably one of the reasons why this attribute is not often measured objectively.

What differentiates mobile applications from more traditional applications is the ability of the user to use the application while moving. In this context, the users’ attention is divided between the act of moving and using the application. About 26% of the studies considered cognitive load. Some of these studies used the change in performance of the user performing the primary task (which was usually walking or driving) as an indication of the cognitive load. Other studies used the NASA TLX [ 11 ] to subjectively measure cognitive load.

Table  5 shows the current research trends within mobile application research. It can be seen that the majority of work is focused on a task approximately 47% of the papers reviewed focus on allowing users to complete a specific task. The range of tasks considered is too broad to provide a detailed description and so we present here only some of the most dominant trends seen within the literature review.

The integration of cameras into mobile devices has enabled the emergence of a new class of application for mobile devices known as augmented reality. For example Bruns and Bimber [ 14 ] have developed an augmented reality application which allows users to take a photograph of an exhibit at an art gallery which allows the system to find additional information about the work of art. Similar systems have also been developed for Points of Interest (POIs) for tourists [ 15 ].

While using maps is a traditional way of navigating to a destination, mobile devices incorporating GPS (Global Positioning Satellite) technology have enabled researchers to investigate new ways of helping users to navigate. A number of systems [ 16 , 17 ] have proposed the use of tactile feedback to help guide users. Through the use of different vibration techniques the system informs users whether they should turn left, right or keep going straight. Another alternative to this is the use of sound. By altering the spatial balance and volume of a user’s music, Jones et al. [ 18 ] have developed a system for helping guide users to their destination.

One of the biggest limitations to mobile devices is the limited input modalities. Developers of apps do not have a large amount of space for physical buttons and therefore researchers are investigating other methods of interaction. This type of research accounts for approximately 29% of the studies reviewed.

The small screen size found on mobile applications has meant that only a small fraction of a document can be seen in detail. When mobile devices are used navigating between locations, this restriction can cause difficulty for users. In an effort to address this issue Burigat et al. [ 19 ] have developed a Zoomable User Interface with Overview (ZUIO). This interface allows a user to zoom into small sections of a document, such as a map, while displaying a small scale overview of the entire document so that the user can see where on the overall document they are. This type of system can also be used with large documents, such as web pages and images.

Audio interfaces [ 20 ] are a type of interface that is being investigated to assist drivers to use in-car systems. Traditional interfaces present information to users by visual means, but for drivers this distraction has safety critical implications. To address this issue audio inputs are common for in-vehicle systems. The low quality of voice recognition technology can limit its effectiveness within this context. Weinberg et al. [ 21 ] have shown that multiple push-to-talk buttons can improve the performance of users of such systems. Other types of interaction paradigms in these papers include touch screens [ 22 ], pressure based input [ 23 ], spatial awareness [ 24 ] and gestures [ 25 ]. As well as using these new input modalities a number of researchers are also looking at alternative output modes such as sound [ 26 ] and tactile feedback [ 27 ].

In addition to considering the specific tasks and input modalities, a small number of researchers are investigating ways to assist specific types of users, such as those suffering from physical or psychological disabilities, to complete common tasks. This type of research accounts for approximately 9% of the evaluated papers. Approximately 8% of the papers evaluated have focused on the context in which mobile applications are being used. The remaining 6% of studies are concerned with new development and evaluation methodologies for mobile applications. These include rapid prototyping tools for in-car systems, the effectiveness of expert evaluations and the use of heuristics for evaluating mobile haptic interfaces.

RQ3 was posed to investigate how usability evaluations are currently conducted. The literature review revealed that 7 of the papers evaluated did not contain any usability evaluations. Some of the remaining papers included multiple studies to evaluate different aspects of a technology or were conducted at different times during the development process. Table  6 shows the percentage of studies that were conducted using each research methodology.

By far the most dominant research methodology used in the examined studies was controlled experiments, accounting for approximately 59% of the studies. In a controlled experiment, all variables are held constant except the independent variable, which is manipulated by the experimenter. The dependant variable is the metric which is measured by the experimenter. In this way a cause and effect relationship may be investigated between the dependant and independent variables. Causality can be inferred from the covariation of the independent and dependent variables, temporal precedence of the cause as the manipulation of the independent variable and the elimination of confounding factors though control and internal validity tests.

Although the most common approach is the use of controlled experiments, other research methodologies were also used. A number of studies evaluated the use of new technologies through field studies. Field studies are conducted in a real world context, enabling evaluators to determine how users would use a technology outside of a controlled setting. These studies often revealed issues that would not be seen in a controlled setting.

For example a system designed by Kristoffersen and Bratteberg [ 28 ] to help travellers get to and from an airport by train without the use of paper tickets was deployed. This system used a credit card as a form of ticket for a journey to or from the airport. During the field study a number of usability issues were experienced by travellers. One user wanted to use a card to buy a ticket for himself and a companion; the system did not include this functionality as the developers of the system had assumed each user would have their own credit card and therefore designed the system to issue each ticket on a different credit card.

The evaluation also revealed issues relating to how the developers had implemented the different journey types, i.e. to and from the airport. When travelling to the airport users are required to swipe their credit card at the beginning and end of each journey, whereas when returning from the airport the user only needs to swipe their card when leaving the airport. One user found this out after he had swiped his card to terminate a journey from the airport, but was instead charged for a second ticket to the airport.

Although controlled experiments and field studies account for almost 90% of the studies, other strategies are also used. Surveys were used to better understand how the public reacted to mobile systems. Some of these studies were specific to a new technology or paradigm, [ 29 ] while others considered uses such as working while on the move [ 30 ]. In two cases (1% of the studies) archival research was used to investigate a particular phenomena relating to mobile technologies. A study conducted by Fehnert and Kosagowsky [ 31 ] used archival research to investigate the relationship between expert evaluations of user experience quality of mobile phones and subsequent usage figures. Lacroix et al. [ 32 ] used archival research to investigate the relationship between goal difficulty and performance within the context of an on-going activity intervention program.

In some cases it was found that no formal evaluation was conducted but instead the new technology presented in the paper was evaluated informally with colleagues of the developers. These evaluations typically contained a small number of participants and provide anecdotal evidence of a system’s usability.

The results obtained during the literature review reinforced the importance of cognitive load as an attribute of usability. It was found that almost 23% of the studies measured the cognitive load of the application under evaluation. These results show that current researchers in the area of mobile applications are beginning to recognise the importance of cognitive load in this domain and as such there is sufficient evidence for including it within the PACMAD model of usability.

The results also show that Memorability is not considered an important aspect of usability by many researchers. Only 2% of the studies evaluated Memorability. If an application is easy to learn then users may be willing to relearn how to use the application and therefore Memorability may indeed not be significant. On the other hand, some applications have a high learning curve and as such require a significant amount of time to learn. For these applications Memorability is an important attribute.

The trade-off between Learnability and Memorability is a consideration for application developers. Factors such as the task to be accomplished and the characteristics of the user should be considered when making this decision. The PACMAD model recommends that both factors should be considered although it also recognises that it may be adequate to evaluate only one of these factors depending on the application under evaluation. The literature review has also shown that the remaining attributes of usability are considered extensively by current research. Effectiveness, Efficiency and Satisfaction were included in over 50% of the studies. It was also found the Errors were evaluated in over 30% of these studies.

When considering the factors that can affect usability, it was found that the task is the most dominant factor being researched. Over 45% of the papers examined focused primarily on allowing a user to accomplish a task. When the interaction with an application is itself considered as a task this figure rises to approximately 75%. Context of use and the User were considered in less than 10% of the papers. Context of use can vary enormously and so should be considered an important factor of usability [ 5 , 33 ]. Our results indicate that context is not extensively researched and this suggests a gap in the literature.

It was revealing that some components of the PACMAD model occur only infrequently in the literature. As mentioned above Learnability and Memorability are rarely investigated, perhaps suggesting that researchers expected users to be able to learn to use apps without much difficulty., This finding could also be due to the difficulty of finding suitable subjects willing to undergo experiments on these attributes or the lack of standard research methods for these attributes. Effectiveness, Efficiency, Satisfaction and Errors were investigated more frequently, possibly because these attributes are widely recognised as important, and also possibly because research methods for investigating these attributes are well understood and documented. Almost a quarter of the studies investigated discussed Cognitive Load. It is surprising that this figure is not higher although this could again be due to the lack of a well-defined research methodology for investigating this attribute.

Conclusions

The range and availability of mobile applications is expanding rapidly. With the increased processing power available on portable devices, developers are increasing the range of services that they provide. The small size of mobile devices has limited the ways in which users can interact with them. Issues such as the small screen size, poor connectivity and limited input modalities have an effect on the usability of mobile applications.

The prominent models of usability do not adequately capture the complexities of interacting with applications on a mobile platform. For this reason, this paper presents our PACMAD usability model which augments existing usability models within the context of mobile applications.

To prove the concept of this model a literature review has been conducted. This review has highlighted the extent to which the attributes of the PACMAD model are considered within the mobile application domain. It was found that each attribute was considered in at least 20% of studies, with the exception of Memorability. It is believed one reason for this may be the difficulty associated with evaluating Memorability.

The literature review has also revealed a number of novel interaction methods that are being researched at present, such as spatial awareness and pressure based input. These techniques are in their infancy but with time and more research they may eventually be adopted.

Appendix A: Papers used in the literature review

Apitz, G., F. Guimbretière, and S. Zhai, Foundations for designing and evaluating user interfaces based on the crossing paradigm. ACM Trans. Comput.-Hum. Interact., 2008. 17(2): p. 1–42.

Arning, K. and M. Ziefle, Ask and You Will Receive: Training Novice Adults to use a PDA in an Active Learning Environment. International Journal of Mobile Human Computer Interaction (IJMHCI), 2010. 2(1): p. 21–47.

Arvanitis, T.N., et al., Human factors and qualitative pedagogical evaluation of a mobile augmented reality system for science education used by learners with physical disabilities. Personal Ubiquitous Comput., 2009. 13(3): p. 243–250.

Axtell, C., D. Hislop, and S. Whittaker, Mobile technologies in mobile spaces: Findings from the context of train travel. Int. J. Hum.-Comput. Stud., 2008. 66(12): p. 902–915.

Baber, C., et al., Mobile technology for crime scene examination. Int. J. Hum.-Comput. Stud., 2009. 67(5): p. 464–474.

Bardram, J.E., Activity-based computing for medical work in hospitals. ACM Trans. Comput.-Hum. Interact., 2009. 16(2): p. 1–36.

Bergman, J., J. Kauko, and J. Keränen, Hands on music: physical approach to interaction with digital music, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Bergman, J. and J. Vainio, Interacting with the flow, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Bertini, E., et al., Appropriating Heuristic Evaluation for Mobile Computing International Journal of Mobile Human Computer Interaction (IJMHCI), 2009. 1(1): p. 20–41.

Böhmer, M. and G. Bauer, Exploiting the icon arrangement on mobile devices as information source for context-awareness, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Bostr, F., et al., Capricorn - an intelligent user interface for mobile widgets, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Brewster, S.A. and M. Hughes, Pressure-based text entry for mobile devices, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Bruns, E. and O. Bimber, Adaptive training of video sets for image recognition on mobile phones. Personal Ubiquitous Comput., 2009. 13(2): p. 165–178.

Brush, A.J.B., et al., User experiences with activity-based navigation on mobile devices, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Burigat, S., L. Chittaro, and S. Gabrielli, Navigation techniques for small-screen devices: An evaluation on maps and web pages. Int. J. Hum.-Comput. Stud., 2008. 66(2): p. 78–97.

Büring, T., J. Gerken, and H. Reiterer, Zoom interaction design for pen-operated portable devices. Int. J. Hum.-Comput. Stud., 2008. 66(8): p. 605–627.

Buttussi, F., et al., Using mobile devices to support communication between emergency medical responders and deaf people, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Chen, N.Y., F. Guimbretière, and C.E. Löckenhoff, Relative role of merging and two-handed operation on command selection speed. Int. J. Hum.-Comput. Stud., 2008. 66(10): p. 729–740.

Chen, T., Y. Yesilada, and S. Harper, What input errors do you experience? Typing and pointing errors of mobile Web users. Int. J. Hum.-Comput. Stud., 2010. 68(3): p. 138–157.

Cherubini, M., et al., Text versus speech: a comparison of tagging input modalities for camera phones, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Chittaro, L. and A. Marassi, Supporting blind users in selecting from very long lists of items on mobile phones, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Chittaro, L. and D. Nadalutti, Presenting evacuation instructions on mobile devices by means of location-aware 3D virtual environments, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Clawson, J., et al., Mobiphos: a collocated-synchronous mobile photo sharing application, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Cockburn, A. and C. Gutwin, A model of novice and expert navigation performance in constrained-input interfaces. ACM Trans. Comput.-Hum. Interact., 2010. 17(3): p. 1–38.

Cox, A.L., et al., Tlk or txt? Using voice input for SMS composition. Personal Ubiquitous Comput., 2008. 12(8): p. 567–588.

Crossan, A., et al., Instrumented Usability Analysis for Mobile Devices International Journal of Mobile Human Computer Interaction (IJMHCI), 2009. 1(1): p. 1–19.

Cui, Y., et al., Linked internet UI: a mobile user interface optimized for social networking, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Cummings, M.L., et al., Supporting intelligent and trustworthy maritime path planning decisions. Int. J. Hum.-Comput. Stud., 2010. 68(10): p. 616–626.

Dahl, Y. and D. Svan, A comparison of location and token-based interaction techniques for point-of-care access to medical information. Personal Ubiquitous Comput., 2008. 12(6): p. 459–478.

Dai, L., A. Sears, and R. Goldman, Shifting the focus from accuracy to recallability: A study of informal note-taking on mobile information technologies. ACM Trans. Comput.-Hum. Interact., 2009. 16(1): p. 1–46.

Decle, F. and M. Hachet, A study of direct versus planned 3D camera manipulation on touch-based mobile phones, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Duh, H.B.-L., V.H.H. Chen, and C.B. Tan, Playing different games on different phones: an empirical study on mobile gaming, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Dunlop, M.D. and M.M. Masters, Investigating five key predictive text entry with combined distance and keystroke modelling. Personal Ubiquitous Comput., 2008. 12(8): p. 589–598.

Ecker, R., et al., pieTouch: a direct touch gesture interface for interacting with in-vehicle information systems, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Eslambolchilar, P. and R. Murray-Smith, Control centric approach in designing scrolling and zooming user interfaces. Int. J. Hum.-Comput. Stud., 2008. 66(12): p. 838–856.

Fehnert, B. and A. Kosagowsky, Measuring user experience: complementing qualitative and quantitative assessment, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Fickas, S., M. Sohlberg, and P.-F. Hung, Route-following assistance for travelers with cognitive impairments: A comparison of four prompt modes. Int. J. Hum.-Comput. Stud., 2008. 66(12): p. 876–888.

Froehlich, P., et al., Exploring the design space of Smart Horizons, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Gellersen, H., et al., Supporting device discovery and spontaneous interaction with spatial references. Personal Ubiquitous Comput., 2009. 13(4): p. 255–264.

Ghiani, G., B. Leporini, and F. Patern, Vibrotactile feedback as an orientation aid for blind users of mobile guides, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Gostner, R., E. Rukzio, and H. Gellersen, Usage of spatial information for selection of co-located devices, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Goussevskaia, O., M. Kuhn, and R. Wattenhofer, Exploring music collections on mobile devices, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Greaves, A. and E. Rukzio, Evaluation of picture browsing using a projector phone, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Hachet, M., et al., Navidget for 3D interaction: Camera positioning and further uses. Int. J. Hum.-Comput. Stud., 2009. 67(3): p. 225–236.

Hall, M., E. Hoggan, and S. Brewster, T-Bars: towards tactile user interfaces for mobile touchscreens, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Hang, A., E. Rukzio, and A. Greaves, Projector phone: a study of using mobile phones with integrated projector for interaction with maps, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Hardy, R., et al., Mobile interaction with static and dynamic NFC-based displays, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Heikkinen, J., T. Olsson, and K. Väänänen-Vainio-Mattila, Expectations for user experience in haptic communication with mobile devices, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Henze, N. and S. Boll, Evaluation of an off-screen visualization for magic lens and dynamic peephole interfaces, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Herbst, I., et al., TimeWarp: interactive time travel with a mobile mixed reality game, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Hinze, A.M., C. Chang, and D.M. Nichols, Contextual queries express mobile information needs, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Hutter, H.-P., T. Müggler, and U. Jung, Augmented mobile tagging, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Jones, M., et al., ONTRACK: Dynamically adapting music playback to support navigation. Personal Ubiquitous Comput., 2008. 12(7): p. 513–525.

Joshi, A., et al., Rangoli: a visual phonebook for low-literate users, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Jumisko-Pyykk, S. and M.M. Hannuksela, Does context matter in quality evaluation of mobile television?, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Kaasinen, E., User Acceptance of Mobile Services. International Journal of Mobile Human Computer Interaction (IJMHCI), 2009. 1(1): p. 79–97 pp.

Kaasinen, E., et al., User Experience of Mobile Internet: Analysis and Recommendations. International Journal of Mobile Human Computer Interaction (IJMHCI), 2009. 1(4): p. 4–23.

Kane, S.K., J.O. Wobbrock, and I.E. Smith, Getting off the treadmill: evaluating walking user interfaces for mobile devices in public spaces, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Kang, N.E. and W.C. Yoon, Age- and experience-related user behavior differences in the use of complicated electronic devices. Int. J. Hum.-Comput. Stud., 2008. 66(6): p. 425–437.

Kanjo, E., et al., MobGeoSen: facilitating personal geosensor data collection and visualization using mobile phones. Personal Ubiquitous Comput., 2008. 12(8): p. 599–607.

Kawsar, F., E. Rukzio, and G. Kortuem, An explorative comparison of magic lens and personal projection for interacting with smart objects, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Keijzers, J., E.d. Ouden, and Y. Lu, Usability benchmark study of commercially available smart phones: cell phone type platform, PDA type platform and PC type platform, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Kenteris, M., D. Gavalas, and D. Economou, An innovative mobile electronic tourist guide application. Personal Ubiquitous Comput., 2009. 13(2): p. 103–118.

Komninos, A. and M.D. Dunlop, A calendar based Internet content pre-caching agent for small computing devices. Personal Ubiquitous Comput., 2008. 12(7): p. 495–512.

Kratz, S., I. Brodien, and M. Rohs, Semi-automatic zooming for mobile map navigation, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Kray, C., et al., Bridging the gap between the Kodak and the Flickr generations: A novel interaction technique for collocated photo sharing. Int. J. Hum.-Comput. Stud., 2009. 67(12): p. 1060–1072.

Kristoffersen, S. and I. Bratteberg, Design ideas for IT in public spaces. Personal Ubiquitous Comput., 2010. 14(3): p. 271–286.

Lacroix, J., P. Saini, and R. Holmes, The relationship between goal difficulty and performance in the context of a physical activity intervention program, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Lavie, T. and J. Meyer, Benefits and costs of adaptive user interfaces. Int. J. Hum.-Comput. Stud., 2010. 68(8): p. 508–524.

Lee, J., J. Forlizzi, and S.E. Hudson, Iterative design of MOVE: A situationally appropriate vehicle navigation system. Int. J. Hum.-Comput. Stud., 2008. 66(3): p. 198–215.

Liao, C., et al., Papiercraft: A gesture-based command system for interactive paper. ACM Trans. Comput.-Hum. Interact., 2008. 14(4): p. 1–27.

Lin, P.-C. and L.-W. Chien, The effects of gender differences on operational performance and satisfaction with car navigation systems. Int. J. Hum.-Comput. Stud., 2010. 68(10): p. 777–787.

Lindley, S.E., et al., Fixed in time and “time in motion”: mobility of vision through a SenseCam lens, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Liu, K. and R.A. Reimer, Social playlist: enabling touch points and enriching ongoing relationships through collaborative mobile music listening, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Liu, N., Y. Liu, and X. Wang, Data logging plus e-diary: towards an online evaluation approach of mobile service field trial, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Liu, Y. and K.-J. Räihä, RotaTxt: Chinese pinyin input with a rotator, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Lucero, A., J. Keränen, and K. Hannu, Collaborative use of mobile phones for brainstorming, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Luff, P., et al., Swiping paper: the second hand, mundane artifacts, gesture and collaboration. Personal Ubiquitous Comput., 2010. 14(3): p. 287–299.

Mallat, N., et al., An empirical investigation of mobile ticketing service adoption in public transportation. Personal Ubiquitous Comput., 2008. 12(1): p. 57–65.

McAdam, C., C. Pinkerton, and S.A. Brewster, Novel interfaces for digital cameras and camera phones, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

McDonald, D.W., et al., Proactive displays: Supporting awareness in fluid social environments. ACM Trans. Comput.-Hum. Interact., 2008. 14(4): p. 1–31.

McKnight, L. and B. Cassidy, Children’s Interaction with Mobile Touch-Screen Devices: Experiences and Guidelines for Design. International Journal of Mobile Human Computer Interaction (IJMHCI), 2010. 2(2): p. 1–18.

Melto, A., et al., Evaluation of predictive text and speech inputs in a multimodal mobile route guidance application, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Miyaki, T. and J. Rekimoto, GraspZoom: zooming and scrolling control model for single-handed mobile interaction, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Moustakas, K., et al., 3D content-based search using sketches. Personal Ubiquitous Comput., 2009. 13(1): p. 59–67.

Oakley, I. and J. Park, Motion marking menus: An eyes-free approach to motion input for handheld devices. Int. J. Hum.-Comput. Stud., 2009. 67(6): p. 515–532.

Oulasvirta, A., Designing mobile awareness cues, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Oulasvirta, A., S. Estlander, and A. Nurminen, Embodied interaction with a 3D versus 2D mobile map. Personal Ubiquitous Comput., 2009. 13(4): p. 303–320.

Ozok, A.A., et al., A Comparative Study Between Tablet and Laptop PCs: User Satisfaction and Preferences. International Journal of Human-Computer Interaction, 2008. 24(3): p. 329–352.

Park, Y.S., et al., Touch key design for target selection on a mobile phone, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Peevers, G., G. Douglas, and M.A. Jack, A usability comparison of three alternative message formats for an SMS banking service. Int. J. Hum.-Comput. Stud., 2008. 66(2): p. 113–123.

Preuveneers, D. and Y. Berbers, Mobile phones assisting with health self-care: a diabetes case study, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Puikkonen, A., et al., Practices in creating videos with mobile phones, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Reischach, F.v., et al., An evaluation of product review modalities for mobile phones, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Reitmaier, T., N.J. Bidwell, and G. Marsden, Field testing mobile digital storytelling software in rural Kenya, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Robinson, S., P. Eslambolchilar, and M. Jones, Exploring casual point-and-tilt interactions for mobile geo-blogging. Personal and Ubiquitous Computing, 2010. 14(4): p. 363–379.

Rogers, Y., et al., Enhancing learning: a study of how mobile devices can facilitate sensemaking. Personal Ubiquitous Comput., 2010. 14(2): p. 111–124.

Rohs, M., et al., Impact of item density on the utility of visual context in magic lens interactions. Personal Ubiquitous Comput., 2009. 13(8): p. 633–646.

Sá, M.d. and L. Carriço, Lessons from early stages design of mobile applications, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Sadeh, N., et al., Understanding and capturing people’s privacy policies in a mobile social networking application. Personal Ubiquitous Comput., 2009. 13(6): p. 401–412.

Salvucci, D.D., Rapid prototyping and evaluation of in-vehicle interfaces. ACM Trans. Comput.-Hum. Interact., 2009. 16(2): p. 1–33.

Salzmann, C., D. Gillet, and P. Mullhaupt, End-to-end adaptation scheme for ubiquitous remote experimentation. Personal Ubiquitous Comput., 2009. 13(3): p. 181–196.

Schildbach, B. and E. Rukzio, Investigating selection and reading performance on a mobile phone while walking, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Schmid, F., et al., Situated local and global orientation in mobile you-are-here maps, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Schröder, S. and M. Ziefle, Making a completely icon-based menu in mobile devices to become true: a user-centered design approach for its development, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Scott, J., et al., RearType: text entry using keys on the back of a device, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Seongil, L., Mobile Internet Services from Consumers’ Perspectives. International Journal of Human-Computer Interaction, 2009. 25(5): p. 390–413.

Sharlin, E., et al., A tangible user interface for assessing cognitive mapping ability. Int. J. Hum.-Comput. Stud., 2009. 67(3): p. 269–278.

Sintoris, C., et al., MuseumScrabble: Design of a Mobile Game for Children’s Interaction with a Digitally Augmented Cultural Space. International Journal of Mobile Human Computer Interaction (IJMHCI), 2010. 2(2): p. 53–71.

Smets, N.J.J.M., et al., Effects of mobile map orientation and tactile feedback on navigation speed and situation awareness, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Sodnik, J., et al., A user study of auditory versus visual interfaces for use while driving. Int. J. Hum.-Comput. Stud., 2008. 66(5): p. 318–332.

Sørensen, C. and A. Al-Taitoon, Organisational usability of mobile computing-Volatility and control in mobile foreign exchange trading. Int. J. Hum.-Comput. Stud., 2008. 66(12): p. 916–929.

Stapel, J.C., Y.A.W.d. Kort, and W.A. IJsselsteijn, Sharing places: testing psychological effects of location cueing frequency and explicit vs. inferred closeness, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Streefkerk, J.W., M.P.v. Esch-Bussemakers, and M.A. Neerincx, Field evaluation of a mobile location-based notification system for police officers, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Takayama, L. and C. Nass, Driver safety and information from afar: An experimental driving simulator study of wireless vs. in-car information services. Int. J. Hum.-Comput. Stud., 2008. 66(3): p. 173–184.

Takeuchi, Y. and M. Sugimoto, A user-adaptive city guide system with an unobtrusive navigation interface. Personal Ubiquitous Comput., 2009. 13(2): p. 119–132.

Tan, F.B. and J.P.C. Chou, The Relationship Between Mobile Service Quality, Perceived Technology Compatibility, and Users’ Perceived Playfulness in the Context of Mobile Information and Entertainment Services. International Journal of Human-Computer Interaction, 2008. 24(7): p. 649–671.

Taylor, C.A., N. Samuels, and J.A. Ramey, Always On: A Framework for Understanding Personal Mobile Web Motivations, Behaviors, and Contexts of Use. International Journal of Mobile Human Computer Interaction (IJMHCI), 2009. 1(4): p. 24–41.

Turunen, M., et al., User expectations and user experience with different modalities in a mobile phone controlled home entertainment system, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Vartiainen, E., Improving the User Experience of a Mobile Photo Gallery by Supporting Social Interaction International Journal of Mobile Human Computer Interaction (IJMHCI), 2009. 1(4): p. 42–57.

Vuolle, M., et al., Developing a questionnaire for measuring mobile business service experience, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Weinberg, G., et al., Contextual push-to-talk: shortening voice dialogs to improve driving performance, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Wilson, G., C. Stewart, and S.A. Brewster, Pressure-based menu selection for mobile devices, in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. 2010, ACM: Lisbon, Portugal.

Wobbrock, J.O., B.A. Myers, and H.H. Aung, The performance of hand postures in front- and back-of-device interaction for mobile computing. Int. J. Hum.-Comput. Stud., 2008. 66(12): p. 857–875.

Xiangshi, R. and Z. Xiaolei, The Optimal Size of Handwriting Character Input Boxes on PDAs. International Journal of Human-Computer Interaction, 2009. 25(8): p. 762–784.

Xu, S., et al., Development of a Dual-Modal Presentation of Texts for Small Screens. International Journal of Human-Computer Interaction, 2008. 24(8): p. 776–793.

Yong, G.J. and J.B. Suk, Development of the Conceptual Prototype for Haptic Interface on the Telematics System. International Journal of Human-Computer Interaction, 2010. 26(1): p. 22–52.

Yoo, J.-W., et al., Cocktail: Exploiting Bartenders’ Gestures for Mobile Interaction. International Journal of Mobile Human Computer Interaction (IJMHCI), 2010. 2(3): p. 44–57.

Yoon, Y., et al., Context-aware photo selection for promoting photo consumption on a mobile phone, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

You, Y., et al., Deploying and evaluating a mixed reality mobile treasure hunt: Snap2Play, in Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. 2008, ACM: Amsterdam, The Netherlands.

Yu, K., F. Tian, and K. Wang, Coupa: operation with pen linking on mobile devices, in Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services. 2009, ACM: Bonn, Germany.

Authors’ note

This research is supported by Oxford Brookes University through the central research fund and in part by Lero - the Irish Software Engineering Research Centre ( http://www.lero.ie ) grant 10/CE/I1855.

Adams R: Decision and stress: cognition and e-accessibility in the information workplace. Springer Universal Access in the Information Society 2007, 5 (4):363–379. 10.1007/s10209-006-0061-9

Article   Google Scholar  

Adams R: Applying advanced concepts of cognitive overload and augmentation in practice; the future of overload. In Foundations of augmented cognition . 2nd edition. Edited by: Schmorrow D, Stanney KM, Reeves LM. Arlington, VA: Springer Berlin Heidelberg; 2006:223–229.

Google Scholar  

Kjeldskov J, Graham C: A review of mobile HCI research methods . Udine, Italy: 5th International Symposium, Mobile HCI 2003; 2003. September 8–11, 2003, Proceedings

Book   Google Scholar  

Nielsen J: Usability engineering. Morgan Kaufmann Pub 1994.

ISO 9241: Ergonomics Requirements for Office Work with Visual Display Terminals (VDTs) International Standards Organisation, Geneva 1997.

Zhang D, Adipat B: Challenges, methodologies, and issues in the usability testing of mobile applications. International Journal of Human-Computer Interaction 2005, 18 (3):293–308. 10.1207/s15327590ijhc1803_3

Guerreiro TJV, Nicolau H, Jorge J, Gonçalves D Proceedings of the 12th international conference on Human computer interaction with mobile devices and services. In Assessing mobile touch interfaces for tetraplegics . Lisbon, Portugal: ACM; 2010. 2010

Chapter   Google Scholar  

Schildbach B, Rukzio E Proceedings of the 12th international conference on human computer interaction with mobile devices and services. In Investigating selection and reading performance on a mobile phone while walking . Lisbon, Portugal: ACM; 2010. 2010

Flood D, Harrison R, Duce D, Iacob C: Evaluating Mobile Applications: A Spreadsheet Case Study. International Journal of Mobile Human Computer Interaction (IJMHCI) 2013, 4 (4):37–65. 10.4018/jmhci.2012100103

Salvucci DD: Predicting the effects of in-car interface use on driver performance: an integrated model approach. International Journal of Human-Computer Studies 2001, 55 (1):85–107. 10.1006/ijhc.2001.0472

Hart SG, Staveland LE: Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Human mental workload 1988, 1 (3):139–183.

Flood D, Germanakos P, Harrison R, Mc Caffery F: Estimating cognitive overload in mobile applications for decision support within the medical domain . Wroclaw, Poland: 14th International conference on Enterprise Information Systems (ICEIS 2012); 2012.

Budgen D, Burn AJ, Brereton OP, Kitchenham BA, Pretorius R: (2010) Empirical evidence about the UML: a systematic literature review . Software: Practice and Experience; 2010.

Bruns E, Bimber O: Adaptive training of video sets for image recognition on mobile phones. Personal Ubiquitous Comput 2009, 13 (2):165–178. 10.1007/s00779-008-0194-3

Schinke T, Henze N, Boll S Proceedings of the 12th international conference on human computer interaction with mobile devices and services, September 07–10, 2010. In Visualization of off-screen objects in mobile augmented reality . Portugal: Lisbon; 2010.

Smets NJJM, Brake GM, Neerincx MA, Lindenberg J Proceedings of the 10th international conference on human computer interaction with mobile devices and services. In Effects of mobile map orientation and tactile feedback on navigation speed and situation awareness . Amsterdam, The Netherlands: ACM; 2008.

Ghiani G, Leporini B, Patern F Proceedings of the 10th international conference on human computer interaction with mobile devices and services. In Vibrotactile feedback as an orientation aid for blind users of mobile guides . Amsterdam, The Netherlands: ACM; 2008.

Jones M, Jones S, Bradley G, Warren N, Bainbridge D, Holmes G: ONTRACK: Dynamically adapting music playback to support navigation. Personal Ubiquitous Computing 2008, 12 (7):513–525. 10.1007/s00779-007-0155-2

Burigat, S, Chittaro, L, Parlato, E, ACM In proceedings of the 10th international conference on Human computer interaction with mobile devices and services (pp. 147–156). Map, diagram, and web page navigation on mobile devices: the effectiveness of zoomable user interfaces with overviews 2008. September

Sodnik J, Dicke C, Tomaic S, Billinghurst M: A user study of auditory versus visual interfaces for use while driving. Int. J. Hum.-Comput. Stud 2008, 66 (5):318–332. 10.1016/j.ijhcs.2007.11.001

Weinberg G, Harsham B, Forlines C, Medenica Z Proceedings of the 12th international conference on human computer interaction with mobile devices and services. In Contextual push-to-talk: shortening voice dialogs to improve driving performance . Lisbon, Portugal: ACM; 2010. 2010

Park YS, Han SH, Park J, Cho Y Proceedings of the 10th international conference on human computer interaction with mobile devices and services. In Touch key design for target selection on a mobile phone . Amsterdam, The Netherlands: ACM; 2008.

Brewster SA, Hughes M Proceedings of the 11th international conference on human-computer interaction with mobile devices and services. In Pressure-based text entry for mobile devices . Bonn, Germany: ACM; 2009.

Oakley I, Park J: Motion marking menus: an eyes-free approach to motion input for handheld devices. Int J Hum.-Comput. Stud 2009, 67 (6):515–532. 10.1016/j.ijhcs.2009.02.002

Hall M, Hoggan E, Brewster S Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. In T-Bars: towards tactile user interfaces for mobile touchscreens . Amsterdam, The Netherlands: ACM; 2008. 2008

McAdam C, Pinkerton C, Brewster SA Proceedings of the 12th international conference on human computer interaction with mobile devices and services. In Novel interfaces for digital cameras and camera phones . Lisbon, Portugal: ACM; 2010. 2010

Heikkinen J, Olsson T, Väänänen-Vainio-Mattila K Proceedings of the 11th international conference on human-computer interaction with mobile devices and services. In Expectations for user experience in haptic communication with mobile devices . Bonn, Germany: ACM; 2009.

Kristoffersen S, Bratteberg I: Design ideas for IT in public spaces. Personal Ubiquitous Comput 2010, 14 (3):271–286. 10.1007/s00779-009-0255-2

Mallat N, Rossi M, Tuunainen VK, Oörni A: An empirical investigation of mobile ticketing service adoption in public transportation. Personal Ubiquitous Comput 2008, 12 (1):57–65.

Axtell C, Hislop D, Whittaker S: Mobile technologies in mobile spaces: findings from the context of train travel. International Journal of Human Computer Studies 2008, 66 (12):902–915. 10.1016/j.ijhcs.2008.07.001

Fehnert B, Kosagowsky A Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. In Measuring user experience: complementing qualitative and quantitative assessment . Amsterdam, The Netherlands: ACM; 2008.

Lacroix J, Saini P, Holmes R Proceedings of the 10th international conference on Human computer interaction with mobile devices and services. In The relationship between goal difficulty and performance in the context of a physical activity intervention program, . Amsterdam, The Netherlands: ACM; 2008.

Maguire M: Context of use within usability activities. International Journal of Human-Computer Studies 2001, 55 (4):453–483. 2001 10.1006/ijhc.2001.0486

Download references

Author information

Authors and affiliations.

Oxford Brookes University, Oxford, UK

Rachel Harrison, Derek Flood & David Duce

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Rachel Harrison .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

DF performed the literature review, helped to propose the PACMAD model and drafted the manuscript. RH assisted the literature review, proposed the PACMAD model and drafted the limitations section. DAD helped to refine the conceptual framework and direct the research. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, rights and permissions.

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Harrison, R., Flood, D. & Duce, D. Usability of mobile applications: literature review and rationale for a new usability model. J Interact Sci 1 , 1 (2013). https://doi.org/10.1186/2194-0827-1-1

Download citation

Received : 10 March 2013

Accepted : 10 March 2013

Published : 07 May 2013

DOI : https://doi.org/10.1186/2194-0827-1-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Mobile Phone
  • Mobile Device
  • Cognitive Load
  • Augmented Reality
  • Usability Model

usability evaluation methods a literature review

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

A literature review about usability evaluation methods for e-learning platforms

Affiliation.

  • 1 Deparment of Production and Systems Engineering, University of Minho, Guimarães, Portugal. [email protected]
  • PMID: 22316857
  • DOI: 10.3233/WOR-2012-0281-1038

The usability analysis of information systems has been the target of several research studies over the past thirty years. These studies have highlighted a great diversity of points of view, including researchers from different scientific areas such as Ergonomics, Computer Science, Design and Education. Within the domain of information ergonomics, the study of tools and methods used for usability evaluation dedicated to E-learning presents evidence that there is a continuous and dynamic evolution of E-learning systems, in many different contexts -academics and corporative. These systems, also known as LMS (Learning Management Systems), can be classified according to their educational goals and their technological features. However, in these systems the usability issues are related with the relationship/interactions between user and system in the user's context. This review is a synthesis of research project about Information Ergonomics and embraces three dimensions, namely the methods, models and frameworks that have been applied to evaluate LMS. The study also includes the main usability criteria and heuristics used. The obtained results show a notorious change in the paradigms of usability, with which it will be possible to discuss about the studies carried out by different researchers that were focused on usability ergonomic principles aimed at E-learning.

PubMed Disclaimer

Similar articles

  • Usability testing with children: an application of Pedactice and Ticese methods. Carusi A, Mont'Alvão C. Carusi A, et al. Work. 2012;41 Suppl 1:822-6. doi: 10.3233/WOR-2012-0248-822. Work. 2012. PMID: 22316823
  • Usability issues in Learning Management Systems (LMS). Muniz MI, Moraes Ad. Muniz MI, et al. Work. 2012;41 Suppl 1:832-7. doi: 10.3233/WOR-2012-0250-832. Work. 2012. PMID: 22316825
  • Cognitive-ergonomics and instructional aspects of e-learning courses. Rodrigues M, Castello Branco I, Shimioshi J, Rodrigues E, Monteiro S, Quirino M. Rodrigues M, et al. Work. 2012;41 Suppl 1:5684-5. doi: 10.3233/WOR-2012-0919-5684. Work. 2012. PMID: 22317652
  • Developing an usability test to evaluate the use of augmented reality to improve the first interaction with a product. Albertazzi D, Okimoto ML, Ferreira MG. Albertazzi D, et al. Work. 2012;41 Suppl 1:1160-3. doi: 10.3233/WOR-2012-0297-1160. Work. 2012. PMID: 22316876 Review.
  • Educational software usability: Artifact or Design? Van Nuland SE, Eagleson R, Rogers KA. Van Nuland SE, et al. Anat Sci Educ. 2017 Mar;10(2):190-199. doi: 10.1002/ase.1636. Epub 2016 Jul 29. Anat Sci Educ. 2017. PMID: 27472554 Review.
  • Design and usability testing of an in-house developed performance feedback tool for medical students. Roa Romero Y, Tame H, Holzhausen Y, Petzold M, Wyszynski JV, Peters H, Alhassan-Altoaama M, Domanska M, Dittmar M. Roa Romero Y, et al. BMC Med Educ. 2021 Jun 23;21(1):354. doi: 10.1186/s12909-021-02788-4. BMC Med Educ. 2021. PMID: 34162382 Free PMC article.
  • Learning without Borders: Asynchronous and Distance Learning in the Age of COVID-19 and Beyond. Brady AK, Pradhan D. Brady AK, et al. ATS Sch. 2020 Jul 30;1(3):233-242. doi: 10.34197/ats-scholar.2020-0046PS. ATS Sch. 2020. PMID: 33870291 Free PMC article.
  • Usability of Learning Moment: Features of an E-learning Tool That Maximize Adoption by Students. Chu A, Biancarelli D, Drainoni ML, Liu JH, Schneider JI, Sullivan R, Sheng AY. Chu A, et al. West J Emerg Med. 2019 Dec 9;21(1):78-84. doi: 10.5811/westjem.2019.6.42657. West J Emerg Med. 2019. PMID: 31913823 Free PMC article.
  • Evaluation of Nursing Information Systems: Application of Usability Aspects in the Development of Systems. Moghaddasi H, Rabiei R, Asadi F, Ostvan N. Moghaddasi H, et al. Healthc Inform Res. 2017 Apr;23(2):101-108. doi: 10.4258/hir.2017.23.2.101. Epub 2017 Apr 30. Healthc Inform Res. 2017. PMID: 28523208 Free PMC article.
  • Towards Usable E-Health. A Systematic Review of Usability Questionnaires. Sousa VEC, Dunn Lopez K. Sousa VEC, et al. Appl Clin Inform. 2017 May 10;8(2):470-490. doi: 10.4338/ACI-2016-10-R-0170. Appl Clin Inform. 2017. PMID: 28487932 Free PMC article. Review.

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources, miscellaneous.

  • NCI CPTAC Assay Portal
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

ORIGINAL RESEARCH article

Person-based design and evaluation of mia, a digital medical interview assistant for radiology.

\r\nKerstin Denecke

  • 1 Artificial Intelligence for Health, Institute for Patient-Centered Digital Health, School of Engineering and Computer Science, Bern University of Applied Sciences, Biel, Switzerland
  • 2 Department of Radiology, Lindenhof Hospital, Bern, Switzerland
  • 3 University Institute for Diagnostic, Interventional and Pediatric Radiology, Inselspital, University Hospital Bern, University of Bern, Bern, Switzerland
  • 4 Department of Radiation Oncology, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland
  • 5 Mimacom AG, Bern, Switzerland

Introduction: Radiologists frequently lack direct patient contact due to time constraints. Digital medical interview assistants aim to facilitate the collection of health information. In this paper, we propose leveraging conversational agents to realize a medical interview assistant to facilitate medical history taking, while at the same time offering patients the opportunity to ask questions on the examination.

Methods: MIA, the digital medical interview assistant, was developed using a person-based design approach, involving patient opinions and expert knowledge during the design and development with a specific use case in collecting information before a mammography examination. MIA consists of two modules: the interview module and the question answering module (Q&A). To ensure interoperability with clinical information systems, we use HL7 FHIR to store and exchange the results collected by MIA during the patient interaction. The system was evaluated according to an existing evaluation framework that covers a broad range of aspects related to the technical quality of a conversational agent including usability, but also accessibility and security.

Results: Thirty-six patients recruited from two Swiss hospitals (Lindenhof group and Inselspital, Bern) and two patient organizations conducted the usability test. MIA was favorably received by the participants, who particularly noted the clarity of communication. However, there is room for improvement in the perceived quality of the conversation, the information provided, and the protection of privacy. The Q&A module achieved a precision of 0.51, a recall of 0.87 and an F-Score of 0.64 based on 114 questions asked by the participants. Security and accessibility also require improvements.

Conclusion: The applied person-based process described in this paper can provide best practices for future development of medical interview assistants. The application of a standardized evaluation framework helped in saving time and ensures comparability of results.

1 Introduction

Medical history forms the basis of clinical diagnosis and decision-making. A medical history interview should be conducted immediately before the investigation or on the same day. The medical history must be acquired frequently and, for some aspects, every time a person is exposed to examinations or interventions ( Taslakian et al., 2016 ). Documentation from referring healthcare institutions is frequently not reliable and does not contain all necessary data items ( Bell et al., 2020 ).

Computer-assisted history-taking systems or digital medical interview assistants (DMIA) are tools that help in obtaining relevant data on the medical history of patients ( Pringle, 1988 ). Although such systems have been available for four decades, they remained unused in clinical routine ( Slack et al., 1966 ). DMIA demonstrated to be efficient in saving professionals' time, improving delivery of care to those with special needs, and also in facilitating information collection, especially of potentially sensitive information (e.g., sexual history and alcohol consumption). Benefits of DMIA include the potential time saving since the patient history can be collected outside the patient-doctor encounter; the administrative burden of entering this information is reduced, patient face-to-face time is increased, and collected data can be automatically added to medical records available for automatic processing for decision support ( Spinazze et al., 2021 ). Another positive aspect is that patients become more engaged in the diagnosis process resulting in improved participation in personal care, compliance with medication, adherence to recommended treatment, and monitoring of prescriptions and doses ( Arora et al., 2014 ). Patient engagement becomes even more relevant with care concepts of value-based and patient-centered care. Good communication with patients has the potential to improve the coordination of care, improve safety and outcomes, increase patient satisfaction ( Nairz et al., 2018 ) and decrease cost of care ( Doyle et al., 2013 ).

Factors related to accessibility, affordability, accuracy, and acceptability have been identified as limitations of DMIA that hampered their adoption in daily routine ( Spinazze et al., 2021 ). Acceptability challenges can originate in usability issues ( Wei et al., 2011 ), i.e., users reported that they had difficulties in interacting with DMIA. Another limitation of existing tools is that irrelevant questions are posed by the system. Beyond, systems are difficult to use, resulting in frustrated users due to technical problems ( Pappas et al., 2011 ). Research suggests that following a person-based approach in the design and development of a new system has potential to improve the system's quality and result in a higher level of user acceptance ( Dabbs et al., 2009 ). Barriers toward the use of DMIA from a healthcare provider's perspective include (1) missing workflows and protocols related to patient-generated health data and (2) data storage, accessibility, and ease of use ( Cohen et al., 2016 ).

In this paper, we are focusing on supporting the medical interview in the context of radiology by a DMIA that has been developed using a person-based approach and considering interoperability standards in healthcare to avoid the above-mentioned limitations of existing solutions. Radiology is a high-throughput medical discipline, highly dependent on and driven by complex imaging technology. These two factors, patient rush and technological advancement, have led to streamlined processes requiring very specialized labor skills. Hence, the radiology process is essentially bipartite, being split into an imaging and a reporting part. The interaction with patients to obtain images of internal body structures is generally performed by medical technicians, while the medical interpretation of images is under the responsibility of physicians. Thereby, the patient-physician relationship is disrupted ( Rockall et al., 2022 ). Patients often do not know the role of radiologists, but they perceive value in consulting directly with imaging experts ( Koney et al., 2016 ). Collecting the medical history in the context of mammography is crucial for several reasons. There are some physiological states or properties of a person that can significantly influence breast tissue and, therefore, impact the evaluation of an image by radiologists. For data on medical history, menopausal status, hormonal therapy or contraception, previous treatment, injuries, or symptoms may significantly impact imaging and change a radiologist's perspective ( Jones et al., 2020 ; Han et al., 2022 ). For example, a vaccination can lead to swollen lymph nodes which can have an impact on the interpretation of the mammography image which makes information on a recent vaccination a relevant information from the patient's medical history. Information from the medical history can also lead to protocol changes for the radiological examination ( Nairz et al., 2018 ). Neither the methodology of information transfer, nor the content of the medical history are currently considered optimal for supporting a radiologist in image interpretation ( Nairz et al., 2018 ; Rockall et al., 2022 ). Based on the use case mammography/breast imaging, this paper describes a DMIA called “MIA,” implemented as a conversational agent that supports a radiologist in gathering accurate and current health information of a patient while giving the patient the opportunity to get answers to questions related to the examination. Conversational agents are software programs or components of larger systems designed to interact with users through natural language ( Laranjo et al., 2018 ; Milne-Ives et al., 2020 ; Tudor Car et al., 2020 ). These agents feature complex technical properties, resulting in various types that span from rule-based systems with simple personalities to more sophisticated embodied agents with complex personalities ( Denecke and May, 2023 ). Conversational agents can deliver information, answer questions, or assist with a range of tasks ( Laranjo et al., 2018 ).

This paper describes the development process of MIA, its system architecture and the results from a comprehensive evaluation of the system including usability assessment.

We have already reported on the design process of MIA in a previous publication (see Denecke et al., 2023 ). Originally, MIA was only supposed to collect the medical history of a person before undergoing a radiological examination. We augmented this first system design with a dedicated module that enables MIA to provide answers to frequently asked questions regarding the examination. In this section, we briefly summarize the design and development of a testable prototype of MIA, and then focus on the evaluation methodology used to assess this prototype.

2.1 Design and development of MIA

2.1.1 requirements gathering.

To ensure that the needs and perspectives of radiologists and patients are taken into account in MIA, the requirements engineering process was guided by a person-based approach as described by Yardley et al. (2015) . This approach aims to embed iterative, in-depth qualitative research throughout the development process to ensure that the intervention is aligned with the psycho-social context of the end users. We also took into account the recommendations of the DISCOVER conceptual framework ( Dhinagaran et al., 2022 ). DISCOVER provides a detailed protocol for the design, development, evaluation, and implementation of rule-based conversational agents. As a result, we established fundamental intervention goals to guide the development of MIA. These goals informed the specification of requirements, which were derived from a narrative literature review and a patient survey, and supplemented by specifications from a radiologist. The patient survey was distributed among members of the patient lobby group of a collaborating hospital (Inselspital Bern) comprising 25 members out of which 8 responded. The collected information was aggregated into functional and non-functional requirements for MIA. The requirement collection process was already described by Denecke et al. (2023) , the list of requirements is made available (see data availability statement).

2.1.2 Content generation

An initial set of 72 medical interview questions in German was defined by a single radiologist. In collaboration with two additional radiologists, this set of questions was reduced to 31 questions and augmented with allowable answers, forming a set of Common Data Elements (CDE). A CDE defines—the attributes and allowable values of a unit of information—and facilitates the exchange of structured information ( Rubin and Kahn Jr, 2017 ). These CDEs were again iteratively improved before integration into MIA with respect to clarity, usefulness, relevance and correctness, as well as feasibility of technical implementation.

For the question answering (Q&A) module of MIA, we collected frequently asked questions related to mammography from information material provided by the Swiss national breast cancer screening program Donna ( https://www.donna-programm.ch ). Furthermore, we interacted with OpenAI ChatGPT to get additional inspiration for possible user questions, using the following prompt: “ Take the role of a woman undergoing a mammography for the first time. Which questions do you have regarding the examination.” The resulting collection of question-answer pairs in German was reviewed, extended and corrected by two radiologists to ensure correctness and completeness. The questionnaire of the interviewer module and the Q&A module is made available (see data availability statement).

2.1.3 System architecture

The prototype of MIA includes two main modules (see Figure 1 ): First, there is the medical interview module, which is designed to work seamlessly with current hospital or radiology information systems. This central component features a web-based user interface that is specially optimized for use on tablets. Second, the Q&A module contains the logic that maps patient questions to pre-defined question-answer pairs.

www.frontiersin.org

Figure 1 . UML component diagram of a MIA instance (external RIS/HIS omitted).

The architecture of MIA was developed in a way that considers two major prerequisites: The architecture should allow to easily exchange the content of the conversational agent, i.e., the questions asked as part of the medial history interview. Beyond, the answers should be stored in a way that allows to import them into a hospital information system, thus ensuring interoperability. Both prerequisites are met by basing MIA on the Fast Healthcare Interoperability Resources (FHIR) standard for healthcare data exchange, published by Health Level Seven International (HL7). To implement FHIR, standardized data exchange formats, so-called FHIR profiles, were specified for defining the medical interview questionnaires and returning the resulting patient responses. A FHIR profile exactly specifies the type, cardinality, and structure of information to be persisted or exchanged between two systems. We based these profiles on the FHIR Structured Data Capture Implementation Guide, version 3.0.0 (SDC IG). The SCD IG is a FHIR-based framework that provides guidance related to filling in medical forms, comprising resource definitions and workflow considerations ( HL7 International, 2023 ). Table 1 provides a description of the four profiles as well as their associated base profiles. The FHIR profiles are made available (see data availability statement).

www.frontiersin.org

Table 1 . Description of the developed FHIR profiles.

As the Q&A module was added only late in the design phase, its data exchange formats were not specified. The Q&A module was not integrated into the MIA system, but is deployed independently and accessed via API call from the MIA system. Please refer to our previous publication, in which we describe the development process and evaluation of the Q&A module in detail ( Reichenpfader et al., 2024 ).

2.1.4 Information flow

The MIA system operates as a conversational agent, which means that the user interacts with the system through a dialogue. The dialogue flow of MIA consists of three distinct parts: On-boarding with authentication, the medical interview conducted by MIA and the question and answer part, where users can ask questions related to the examination. We describe the user interaction of the final system including the process triggering from the hospital information system in more detail below.

The MIA user interface is optimized for being accessed on a tablet, ideally hospital-owned. The process of conducting a medical interview for a certain patient, also called a task, is triggered within the hospital or radiology information system (see Figure 2 ). A MiaTask resource is sent to the MIA application, which then downloads the specified MiaQuestionnaire resource, containing the content and structure of the interview. Hospital staff see all open tasks on the tablet and trigger the start of a specific task before handing the patient the device. The device then displays the first part of the conversation flow, the on-boarding. For our usability testing, MIA did not communicate with external systems. One task and one questionnaire resource were hard-coded in the system and started by the test facilitator.

www.frontiersin.org

Figure 2 . UML sequence diagram of an MIA interaction.

During the on-boarding process, users are welcomed by MIA, informed about how their data are handled, and asked to identify themselves by providing their full name as well as their birth date (in the usability test, pre-defined data was used). If they do not provide the right information as defined in the associated MiaPatient resource, the interaction process is suspended, and the patient is asked to reach out to hospital staff for help.

After successful identification, the MIA prototype renders a maximum of 31 questions about the patient's medical history, as defined in the MiaQuestionnaire resource. The questions concern previous visits to physicians and therapies related to the breast (chemotherapy, radiation therapy, etc.), and related to observations of recent changes in the breast region including pain or injuries. Eighteen questions allow for single-choice answers. Seven questions allow for multiple-choice answers and six questions are answered by entering free text. The system does not interpret free-text responses. Only responses to single- or multiple-choice questions change the conversation flow. For example, questions about pregnancy are only asked if the patient states not to be male. We make the MiaQuestionnaire resource used for usability testing available (see data availability statement).

In the third part of the dialogue flow, the user can ask free-text questions regarding the topic of mammography. Each patient question is individually sent to the Q&A module, which computes the most similar pre-defined question and returns the answer to MIA. The 33 predefined question-answer-pairs of the Q&A module, to which patient queries are made available (see data availability statement).

After finalizing the third part of the dialogue flow, a confirmation page is displayed with all questions asked by the user, as well as the responses they submitted. This allows the user to verify their responses before submitting their response. Currently, the answers shown in this summary cannot be edited. After the user submitted their responses, the system populates the MiaQuestionnaireResponse resource, containing the corresponding answers to the questions defined in the MiaQuestionnaire resource. This resource is then transmitted to the initiating system. For the usability test, the resource is not sent, but a log file is generated and downloaded locally instead. See Figures 1 , 2 and for a UML component- and sequence-diagram of the system, respectively.

2.2 Evaluation of MIA

2.2.1 underlying evaluation framework.

The evaluation of MIA was conducted based on the evaluation and development framework proposed by Denecke (2023a) . The framework consists of four perspectives that in turn aggregate several evaluation categories:

• Global perspective: Accessibility, ease of use, engagement, classifier performance, flexibility, content accuracy, context awareness, error tolerance, and security.

• Response understanding perspective: Understanding.

• Response generation perspective: Appropriateness, comprehensibility, speed, empathy, and linguistic accuracy.

• Aesthetics perspective: Background, font, and buttons.

Furthermore, the framework suggests concrete metrics and heuristics to be used to evaluate a conversational agent in healthcare. We adapted the framework by removing the aspects that are not relevant for MIA. For example, from the aesthetics perspective, we removed the evaluation category “button” since MIA has no buttons. From the response generation perspective, we dropped “linguistic accuracy” since MIA does not generate answers, but simply posts phrases from the knowledge base. The complete set of evaluation aspects are listed in the Supplementary material (see data availability statement).

To evaluate MIA according to the heuristics and metrics described in the framework, we (1) conducted a technical evaluation of MIA (e.g., security aspects) using a design and implementation check, (2) assessed the usability using a task-based usability test and (3) analyzed conversation protocols as collected during the usability test.

2.2.2 Study design and procedure for usability testing

The goal of the usability testing was to determine to what extent usability (efficiency, effectiveness, acceptance) is achieved by the current implementation as well as to identify aspects on how to improve the user interface and the conversation flow of MIA. The usability test was conducted under controlled test conditions with representative users. Prior to participant recruitment, the study plan underwent review by the regional ethics committee and was determined to be exempt from approval (BASEC-Nr: Req-2023-00982).

We aimed at recruiting a total of 30 patients undergoing a mammography for the usability testing from two collaborating hospitals. According to usability expert Jacob Nielsen, “testing with 20 users typically offers a reasonably tight confidence interval” within usability testings ( Nielsen, 2006 ).

The usability test was conducted in a closed room within each respective hospital. In addition to these patients, we recruited members of two different patient organizations to join the usability test. Their usability test was following the same procedure as the patients, except that they were not undergoing a mammography examination afterward and they were answering an additional questionnaire with heuristics. The following exclusion criteria for participant recruitment were defined:

• No basic skills with interacting with a smartphone or tablet.

• Unwillingness to interact with MIA.

• No knowledge of German language of at least B1 level.

• Patients who are unable to read or write.

• Patients younger than 18 years.

The participants did not receive any monetary compensation, but a small box of chocolate. The usability test comprised two tasks: First, participants were instructed to answer the questions that are asked by MIA, followed by asking at least three questions to MIA related to an upcoming mammography examination. The participants were asked to think aloud and to provide honest opinions regarding the usability of the application, and to participate in post-session subjective questionnaires and debriefing. Below, we describe the test procedure in detail. Each test session was conducted in a separate room to ensure privacy and accompanied by a facilitator, who:

• provided an overview of the study and the system to participants,

• defined the term usability and explained the purpose of usability testing to participants,

• assisted in conducting participant debriefing sessions,

• responded to participant‘s requests for assistance, and

• collected the comments provided by the participants during the testing and during the post-testing interview.

Upon agreement to participate in the test, each participant was assigned a random identifier and provided with a test device (Apple iPad Pro 2018). First, participants filled in the first part of the online questionnaire, which was created using a local LimeSurvey instance. This first part comprised demographic information as well as questions to validate fulfillment of any exclusion criteria, namely knowledge on the topic of mammography, whether the person has already had a mammography before, gender, age, familiarity with tablet and mobile phone use and self-judgment of German language skills.

If the participant was not excluded, they continued with the actual usability test. To ensure anonymous data collection, each participant was provided with the same fictitious name and date of birth. In this way, all data was collected anonymously. After the completion of both tasks, the participant was provided with the second part of the online questionnaire, comprising standardized questionnaires to be rated on a 5-item Likert Scale. We applied the Bot Usability Scale as described by Borsci et al. (2022) . Additionally, we added eleven additional questions that were part of the evaluation framework ( Denecke, 2023a ). These questions refer to empathy expressed by MIA, comprehensibility and perception of the capabilities of MIA as well as aesthetic aspects such as background color and font type.

The members of patient organizations additionally assessed MIA based on eleven heuristic criteria for conversational agents in healthcare proposed by Langevin et al. (2021) . For each heuristic, we defined a concrete catalog of criteria for assigning 1, 2, or 3 points per item. The heuristics and the criteria can be found in the Supplementary material .

2.2.3 Data analysis

To ensure an unbiased data analysis, the data collected in the user study was analyzed by two authors (DW, KK) who were neither involved in the development of the system nor involved in the treatment of the participants. We collected the data from the conversation protocols (i.e., the interaction between MIA and participant), the usability questionnaires and the notes taken by the facilitators.

3.1 Results from the technical evaluation

In the following, we summarize the results of the technical evaluation that resulted from a design and implementation check. In Section 3.2, we report all results that have been collected within the usability testing. The complete list of results of the evaluation framework is available in the Supplementary material .

3.1.1 Accessibility

The readability of MIA's content was calculated using four different readability scores: SMOG (Simple Measure of Gobbledygook) Readability, Gunning Fog Index, Flesch Reading Ease Score and LIX ( Fabian et al., 2017 ). While Flesch Reading Ease Score, Gunning Fog Index and SMOG consider syllables and unfamiliar words for their calculation, LIX calculates the percentage of words with seven or more letters, i.e., it calculates the index by considering the number of sentences and the number of long terms. Table 2 summarizes the scores for the interview module and Q&A module.

www.frontiersin.org

Table 2 . Results from readability assessment.

The content of the Q&A module reaches an average LIX value of 51/100 which corresponds to a language level of C1 (Common European Framework of Reference for Languages). The interview module has a readability index of LIX 67/100, corresponding to language level C2. The other scores provide a slightly different picture. In these assessments, the content of the Q&A module is recognized as rather complex to be understood. A Flesch Reading Ease Score of 46, a Gunning Fog Index of 17.46 and a SMOG of 46 for the content of the Q&A module correspond to college or undergraduate reading level, i.e., difficult to read. For the interview module, a LIX of 64 and SMOG of 10.45 correspond to plain English, to be easily understood by 13–15 year old students. A Gunning Fog Index of 11.15 corresponds to 11th grade, i.e., fairly difficult to read.

MIA does not provide alternatives for written input or output. The contrast between text and background color is 3.6:1. It is possible to resize text in the graphical user interface. Accessibility guidelines have not been considered in the development phase.

3.1.2 Content accuracy

The underlying knowledge base of MIA is evidence-based and healthcare professionals as well as representatives of patient organizations were involved in the development process. However, a maintenance process for MIA's content has not yet been developed as the current prototype is considered a proof-of-concept implementation. Information on the developer and content provider is shown to the user during the on-boarding process.

3.1.3 Context awareness

Context awareness is not given in the current implementation of MIA; context switches are not recognized because of the realization of MIA. User input in the interview module is only used to decide whether a follow-up question is asked or not.

3.1.4 Flexibility in dialogue handling

Flexibility in dialogue handling is not yet provided by MIA given its rule-based implementation and the missing interpretation of user input. The interview module asks only one question after another as foreseen in the pre-defined conversation flow. In the Q&A module, the user query is matched with the knowledge base. When no match can be found, a standard answer is provided.

3.1.5 Security

Only few security measures are implemented in the prototype: User authentication, authorization, and session management were implemented. A privacy statement is provided to the user in the on-boarding process. Standard operating procedures are in place for processing personal identifiable information according to the privacy statement. MIA is compliant with the current regulations about data privacy which is the general data protection regulation in Europe. The programming packages used in MIA are scanned for vulnerabilities. For the development of MIA, no security-by-design approach was followed and no established security management standard was applied. No measures have been implemented for managing reliability and maintenance of third party software and components used. Additionally, collected data is not encrypted and no process has been established yet to test the security of MIA on a regular basis.

3.1.6 Technical issues

Technical issues were collected using the notes made by the facilitators during the tests. They were grouped into high priority, low priority, additional requirements and additional aspects mentioned. Five issues were classified high priority. For example, the entry of the date of birth was suggested to be facilitated. Three items were of low priority, e.g., to improve the drop down menu. Specifically, a drop down list is redundant when only one out of two items can be selected. Two additional requirements were collected: It was suggested to add a button for requesting talking to a human and to display the patient name in the chat view. FOur additional aspects were mentioned, e.g., that it was unclear, in which order name and surname have to be entered for authentication.

3.2 Results from the usability test

3.2.1 participants.

We included 36 participants in the usability test. 30/36 participants were actual patients interacting with MIA before undergoing their mammography examination. 6/36 were members of patient organizations. They filled an extended form of the usability questionnaire including heuristic criteria. The tests were conducted on 5 days between December 2023 and February 2024.

Four participants were between 40 and 49 years old (11.1%); 12 (33.3%) between 50 and 50 years; nine participants (25%) were between 60 and 69 years old and 11 participants 70 years or older (30.5%). The majority of participants were women (32 participants, 88.9%). Thirty-five participants (97.2%) were native speakers in German, one participant selected language level B2. Two participants had no previous experience with smartphones and tablets (5.5%). Out of the individuals surveyed, 30 (83.3%) had undergone a mammography at some point earlier in their lives. Four individuals had little to no knowledge about mammography. Three participants had only a minimal understanding of the topic. Nineteen individuals possessed basic knowledge about mammography, while 10 participants were very familiar with the subject.

3.2.2 Results from the heuristic evaluation

Results from the heuristic evaluation are shown in Figure 3 , n = 6. Since the questions were not mandatory, four questions were only answered by 5/6 participants and one question by 4/6 participants. All other questions were answered by six participants. It can be seen that there is still potential for improving user control and freedom (question 3) where the smallest mean values were achieved. Furthermore, help and guidance (question 6) shows potential for improvement. The other items achieved mean values of 2 and above. The sum of the mean values is 24 out of a maximum of 33 points.

www.frontiersin.org

Figure 3 . Heuristic evaluation. 1 = Visibility of system status, 2 = Match between system and real world, 3 = User control and freedom, 4 = Consistency and standards, 5 = Error prevention, 6 = Help and guidance, 7 = Flexibility and efficiency of use, 8 = Aesthetic, minimalist, and engaging design, 9 = Help users recognize, diagnose, and recover from errors, 10 = Context preservation, and 11 = Trustworthiness.

3.2.3 Usability questionnaire

The questionnaire for the Bot usability scale (BUS-11) was answered by 36 participants. Results are shown in Figure 4 . The BUS-11 questionnaire is provided as Supplementary material (see data availability statement). Perceived accessibility to the chatbot function was good (BUS11_SQ001 and SQ002). The system does not provide any other functions than the chatbot. Perceived quality of chatbot functions consists of three questions (BUS11_SQ003-5) that were to be judged by the participants. Eighty-six percentage agreed with the statement that communication with the chatbot was clear. Seventy-two percentage agreed that MIA was able to keep track of the context. 83% confirmed that MIA's responses were easy to understand. Perceived quality of conversation and information provided (BUS11_SQ006-9) shows potentials for improvement. Sixty-seven percentage agreed and find that the chatbot understands what they want and helps achieving the goal. Sixty-seven percentage think the chatbot provides them with the appropriate amount of information. Sixty-one percentage participants agreed with the statement that the chatbot only gives the information needed. Fifty-six percentage had the impression the chatbot answers were accurate. Perception of privacy and security was limited (BUS11_SQ010): Only 47% agreed that they believe MIA informs them of any possible privacy issues. Time response (BUS11_SQ011) was short as stated by 92% of the participants.

www.frontiersin.org

Figure 4 . Results from BUS-11 questionnaire (Bot usability scale), n = 36, BUS11_SQ001=“The chatbot function was easily detectable.” BUS11_SQ002=“It was easy to find the chatbot.” BUS11_SQ003=“Communicating with the chatbot was clear.” BUS11_SQ004=“The chatbot was able to keep track of context.” BUS11_SQ005=“The chatbot's responses were easy to understand.” BUS11_SQ006=“I find that the chatbot understands what I want and helps me achieve my goal.” BUS11_SQ007=“The chatbot gives me the appropriate amount of information.”, BUS11_SQ008=“The chatbot only gives me the information I need.” BUS11_SQ009=“I feel like the chatbot's responses were accurate.” BUS11_SQ010=“I believe the chatbot informs me of any possibly privacy issues.” BUS11_SQ011=“My waiting time for a response from the chatbot was short.”

The results of the additional questions on usability aspects are shown in Figure 5 . Font type and size were perceived well. Eighty-nine percentage participants agreed that the font was easy to read and the size was appropriate (Eval_SQ010 and SQ011). Six percentage disagreed with these statements. Sixty-four percentage liked the background color while 8% disliked it (Eval_SQ009).

www.frontiersin.org

Figure 5 . Results from additional usability questions (n = 36), Eval_SQ001=“I have the impression that the digital assistant understands what I want to know.” Eval_SQ002=“I have the impression that the digital assistant helps me to get answers to my questions about the examination.” Eval_SQ003=“The digital assistant provides me with an appropriate amount of information.” Eval_SQ004=“I have the feeling that the digital assistant's answers are tailored to my needs.” Eval_SQ005=“The waiting time for a response from the digital assistant was in line with my expectations.” Eval_SQ006=“The digital assistant did not recognize how much I was bothered by some of the things discussed.” Eval_SQ007=“The digital assistant understood my words but not my feelings.” Eval_SQ008=11 “I think the digital assistant generally understood everything I said.” Eval_SQ009=I like the background color.” Eval_SQ010=“The font was easy to read.” Eval_SQ011=“The font size was appropriate for me.”

There were additional questions asked related to comprehensibility and understanding. Sixty-four percentage think that MIA generally understood them; 14% did not confirm this (Eval_SQ008). Forty-two percentage had the impression that MIA was understanding them, but not their feelings (Eval_SQ007). Twenty-two percentage disagreed with that statement. Further, 33% confirmed that MIA did not recognize how much they were bothered by some things discussed (Eval_SQ006). Thirty-nine percentage disagreed with that statement. The waiting time for response was within expectation for 89% of the participants in contrast to 3% who disagreed (Eval_SQ005).

The answers were perceived as tailored to user needs by 50% of the participants; 17% disagreed (Eval_SQ004). Fifty-six percentage participants agreed that the appropriate amount of information was provided by MIA; 11% disagreed (Eval_SQ003). Additionally, 61% agreed that MIA helps to get answers on the examination, while 14% disagreed with that statement (Eval_SQ002). Fifty-eight percentage of the participants had the impression, MIA understands what they want to know; 25% neglected this (Eval_SQ001). The results are shown in Figure 5 .

In the verbal feedback, one participant claimed that the amount of text was too large. Texts should be rather split into several chunks. All questions asked by MIA were mandatory, which was not perceived well. MIA was perceived as not empathetic since the system did not address comments such as “I do not feel well.” It was suggested that MIA could guide better through the different topics for example by providing some information on the topic of the coming questions (e.g., “Now, I will ask questions on wellbeing”). During the interview, MIA asks whether the situation of the patient improved since the last visit to a doctor. Regarding this question, one participant stated that this is difficult to judge when the visit was more than 2 years ago. Some questions were perceived as useless in the context of mammography.

3.2.4 Analysis of interaction with Q&A module

While 36 usability questionnaires were analyzed, only 35 interaction protocols could be analyzed since one interaction was excluded. This participant was seriously visually impaired and the interaction with MIA was done by one of the facilitators who read the questions asked by MIA and entered the answers from the participant.

A total of 114 questions related to the topic of mammography were asked by the 36 participants to the Q&A module. The questions can be divided into two main topics: disease-related (16 questions) and examination-related (Mammography, 92 questions). Questions related to the disease (breast cancer) can be grouped into gender-related, scientific perspectives related to the treatment (e.g., how will medicine for breast cancer look like in 10 years), mortality rate, inheritance of the disease, questions on tumors and age-related questions (see Figure 6 ). Questions related to the examination addressed its duration, frequency, age, costs, sensation of pain, results and aspects related to the procedure. Figure 7 shows the clustering of the queries related to the procedure of the examination. It can be seen that most questions referred to aspects during the examination and possible side effects.

www.frontiersin.org

Figure 6 . Clustering of the queries related to the disease asked by the 36 participants to the Q&A Module. Sixteen out of 114 questions dealt with the disease. Numbers in brackets refer to the number of questions belonging to this cluster.

www.frontiersin.org

Figure 7 . Clustering of the queries asked by the 36 participants to the Q&A Module related to the procedure of a mammography. Numbers in brackets refer to the number of questions belonging to this cluster ( n = 114).

Figure 8 shows the clustering of the queries related to the results collected by the examination. Participants asked about aspects that can be seen in the mammogram, the quality of the results for diagnosis, publication of the results and analysis related to the mammogram.

www.frontiersin.org

Figure 8 . Clustering of the queries asked by the 36 participants to the Q&A Module referring to the results of the mammography. Numbers in brackets refer to the number of questions belonging to this cluster ( n = 114).

The evaluation showed that MIA's Q&A module is not flexible in dialog handling, basically because of its matching algorithm based on question similarity. In case a patient question is not similar enough to a pre-defined one, a fallback mechanism is triggered and no answer is given. The question asked must contain more than 3 words for the module to provide an answer. MIA can not handle shorter queries.

The Q&A module achieved a precision of 0.51, a recall of 0.87, an F -score of 0.64 and an accuracy of 0.54. This corresponds to 47/114 true positive answers (question part of the knowledge base and correctly answered, 46/114 false positives (question not part of the knowledge base, but answered by MIA), 7/114 false negative answers (question part of knowledge base, but not answered), and 14 true negatives (question not part of the knowledge base and not answered).

4 Discussion

4.1 principal findings.

We designed, developed, and evaluated a prototype of a medical interview assistant for radiology for the concrete use case of collecting information from patients before undergoing a mammography. Unlike other DMIAs, our system can render any definition of medical interview questions as long as they follow the defined FHIR profile. Furthermore, MIA allows patients to ask questions on the examination they are supposed to undergo. We conducted a comprehensive usability test with 36 participants. Comparable studies only include 10 participants, as for example the study on a hypertension self-management chatbot described by Griffin et al. (2023) . A specific strength of MIA is its standardized format for exchanging medical interview questions and associated patient answers, based on HL7 FHIR. To the best of our knowledge, this is the first implementation of a medical interview assistant as conversational agent that allows to transfer the collected information in HL7 FHIR format and thus ensures seamless interoperability with clinical information systems.

The Q&A module provides a benefit for patients who can ask their questions related to the examination. It remains to the future to assess whether this component improves also patient's satisfaction with the examination process, whether possible fears can be reduced and whether they feel better informed. Another strength of the developed system is that the knowledge base of the Q&A module is designed to be improved over time: Patient questions that cannot be answered by the system are stored. These unanswered questions can periodically be reviewed by physicians, who then add a corresponding answer to the system. Thus, the knowledge base is extended and adapted to real-world patient needs on an ongoing basis.

The rather low precision of 0.51 achieved in the usability test for the Q&A module might be due to the following two reasons: First, the initial knowledge base created for usability testing consists of only 33 question-answer-pairs. Adding additional categories of questions as well as alternative formulations of existing question-answer-pairs might increase the true positive rate and decrease the true negative rate. Second, the required similarity threshold for an answer to be matched is set to 0.7 (cosine similarity). By gradually increasing this threshold, the false positive rate might be reduced until false negative rate increases.

Although recent developments including large language models are showing great potential in various healthcare settings ( Denecke et al., 2024 ), MIA was purposely designed as a rule-based agent to limit the shortcomings and pitfalls that conversational agents can lead to Denecke (2023b) . While this design decision caused less flexibility in the conversation flow, the major advantage is the gained control over the system and the avoidance of misinformation: Neither during the medical interview nor when answering questions the system is going to hallucinate, making up answers or providing clinically wrong responses. Therefore, there is no risk of misinformation since the complete knowledge base was provided by clinical experts. However, we see a future application of large language models in the context of MIA, namely in improving accessibility: A large language model could be used to tailor the standardized and pre-defined content of MIA to the specific needs of the patient. This could include providing explanations or re-formulate statements in the language preferred or better understandable by an individual user.

Based on a test with real patients in the setting as the system is supposed to be used, we achieve a technology readiness level (TRL) of 6 for our implementation (with TRL 9 as maximum, defined as actual system proven in operational environment). However, several aspects still have to be considered before applying MIA for information collection in a treatment setting. Using HL7 FHIR allows to integrate the collected information into a clinical information system - however, we have not yet tested this. Moreover, HL7 FHIR has not been widely adopted yet. Accessibility still has to be improved: The system uses large amounts of texts that might not be well-understood by all patients. The readability assessment shows that the language is quite complex. One participant claimed that there is too much text to read. A voice input or output has not yet been realized, which means that visually impaired persons are excluded from the usage. Such a situation occurred during the usability test and demonstrated the need for ensuring accessibility. We plan to address these issues in future by considering principles of inclusive design.

The evaluation showed that the aspects of perceived privacy and security have to be improved. This is essential for achieving user acceptance of such an interview assistant given the fact that privacy of patient-doctor communication is protected by law. Also important for trustworthiness is the perceived quality of the system by its users - in this regard, MIA still has potential for improvements. In particular the Q&A module has to be improved to be able to answer more patient questions in an acceptable manner. Steerling et al. (2023) found in their study that individual characteristics, characteristics of the artificial intelligence-based solution and contextual characteristics influence on user's trust. Nadarzynski et al. (2019) confirmed that transparency of the chatbot system regarding its quality and security is important for engagement and trust of users. Interestingly, some questions asked by the medical interview module were perceived as redundant by the users in the context of mammography (e.g., a question on vaccinations). In the context of an actual patient-doctor encounter, a physician could explain why the information is relevant.

4.2 Reflections on the application of the evaluation framework

In this work, we used an existing framework proposed to evaluate conversational agents in healthcare ( Denecke, 2023a ). We conclude that the framework is useful to evaluate a chatbot from a technical perspective. The provided metrics and methods can be easily applied and we saved time in designing the evaluation procedure by relying upon the specified metrics and methods. It also helped in reducing time for developing the evaluation questionnaire. We had to recognize that some questions are quite similar, resulting in participants asking for clarification.

The framework is very comprehensive and when applying it to our system, we had to select the aspects that are of relevance. We had to drop 19 metrics since they were redundant (e.g., button color). However, we believe that it is better in terms of comparability to drop metrics instead of adding new ones and, in this way, loose comparability of evaluation results. In our case, application of the framework was slightly challenging, since the system consists of two modules that are realized differently that could have been even considered as individual conversational agents because of their different technical implementation and purpose. The interview module only asks questions without any interpretation. The Q&A module does initiate the conversation, but only replies to a question asked by a patient. However, we decided to consider both modules as one system as they are perceived by the users as one system. For the usability test, we applied a set of standard tools (BUS-11, Borsci et al., 2022 and heuristics, Langevin et al., 2021 ). The test sessions showed that some items of the BUS-11-scale were not relevant for our system, but to be comparable with other assessments that use BUS-11, we did not remove them. For example, one item is “ It was easy to find the chatbot” —our system only consists of the chatbot; there is no way to miss it since it starts as soon as the web page has been opened.

Applying the framework at an earlier stage of the development process could have helped in addressing the accessibility and security aspects right from the beginning. However, the evaluation using the framework clearly showed what has to be done to be able to use the system in real world. The evaluation results will thus help us to improve the system in the next iteration of development.

4.3 Limitations

Our evaluation has some limitations. The sample was limited to participants from two hospitals and some participants from two patient organizations. All participants had good German language skills: most of them were German native speakers. This might limit the generalizability of the results in terms of understandability. Furthermore, the analysis of the readability using the LIX Readability index has limitations: It considers the number of sentences when calculating the index. We entered the complete set of queries asked by MIA to the calculation platform. It would have been better to determine the index for each statement and calculate the mean value of the indices. To address this issue, we applied three other readability scores. Altogether, they provide a clear picture of the readability of the text provided by MIA.

5 Conclusion

In this paper, we introduced a medical interview assistant with a conversational user interface for radiology. We provided an overview of our design and development process, which may provide best practices for the person-based development of such systems. In particular, we make the FHIR profiles available and we recommend their application in similar systems to foster interoperability. We applied a comprehensive evaluation framework to study the quality of all relevant technical aspects of the system. The results can serve as a benchmark for future implementations of medical interview assistants with conversational user interface. Given the increased interest in collecting patient-reported outcome measures as quality indicators, our work may pave the way to collect such measures using a conversational agent. This might provide an improved user experience for some patient groups. In future work, the relevance of the collected information for the diagnostic process will be studied. We will improve the system following the potentials for improvement derived from the evaluation and user testing. A content maintenance process will be developed to allow for quick adaptations of the questionnaires to other examinations or even for the collection of patient-reported outcome measures.

Data availability statement

The original contributions presented in the study are included in the article/ Supplementary material , further inquiries can be directed to the corresponding author.

Ethics statement

Prior to participant recruitment, the study plan underwent review by the Regional Ethics Committee (Kantonale Ethikkommission, Kanton Bern) and was determined to be exempt from approval (BASEC-Nr: Req-2023-00982). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants' legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

KD: Conceptualization, Methodology, Visualization, Writing – original draft, Writing – review & editing. DR: Investigation, Software, Visualization, Writing – original draft, Writing – review & editing. DW: Investigation, Writing – review & editing. KK: Investigation, Writing – review & editing. HB: Resources, Writing – review & editing. KN: Project administration, Writing – review & editing. NC: Resources, Writing – review & editing. DP: Software, Writing – review & editing. HvT-K: Resources, Writing – review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded by the Swiss Innovation Agency, Innosuisse, grant number: 59228.1 IP-ICT.

Acknowledgments

We thank all members of the patient lobby of the University Hospital Bern as well as members of the Cancer League Switzerland for sharing their expertise and for participating voluntarily in the usability test. We extend our gratitude to the staff of the two collaborating hospitals for their exceptional efforts in coordinating the infrastructure and adjusting patient visit logistics, which facilitated the successful execution of our usability test. Specifically, we thank the team of radiology technologists as well as the front desk of the Radiology Department of the Lindenhof Group for their support. Additionally, we would like to thank Philipp Rösslhuemer for reviewing the knowledge base integrated into the MIA chatbot.

Conflict of interest

DP was employed at Mimacom AG at the time of chatbot development and paper writing. Thus, Mimacom AG financed the implementation of the system prototype.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2024.1431156/full#supplementary-material

Arora, S., Goldberg, A. D., and Menchine, M. (2014). Patient impression and satisfaction of a self-administered, automated medical history-taking device in the emergency department. West. J. Emerg. Med . 15:35. doi: 10.5811/westjem.2013.2.11498

PubMed Abstract | Crossref Full Text | Google Scholar

Bell, S. K., Delbanco, T., Elmore, J. G., Fitzgerald, P. S., Fossa, A., Harcourt, K., et al. (2020). Frequency and types of patient-reported errors in electronic health record ambulatory care notes. J. Am. Med. Assoc. Netw. Open 3:e205867. doi: 10.1001/jamanetworkopen.2020.5867

Borsci, S., Malizia, A., Schmettow, M., Van Der Velde, F., Tariverdiyeva, G., Balaji, D., et al. (2022). The chatbot usability scale: the design and pilot of a usability scale for interaction with ai-based conversational agents. Person. Ubiquit. Comput . 26, 95–119. doi: 10.1007/s00779-021-01582-9

Crossref Full Text | Google Scholar

Cohen, D. J., Keller, S. R., Hayes, G. R., Dorr, D. A., Ash, J. S., and Sittig, D. F. (2016). Integrating patient-generated health data into clinical care settings or clinical decision-making: lessons learned from project healthdesign. JMIR Hum. Fact . 3:e5919. doi: 10.2196/humanfactors.5919

Dabbs, A. D. V., Myers, B. A., Mc Curry, K. R., Dunbar-Jacob, J., Hawkins, R. P., Begey, A., et al. (2009). User-centered design and interactive health technologies for patients. Comput. Informat. Nurs . 27, 175–183. doi: 10.1097/NCN.0b013e31819f7c7c

Denecke, K. (2023a). Framework for guiding the development of high-quality conversational agents in healthcare. Healthcare 11:1061. doi: 10.3390/healthcare11081061

Denecke, K. (2023b). Potential and pitfalls of conversational agents in health care. Nat. Rev. Dis. Prim . 9:66. doi: 10.1038/s41572-023-00482-x

Denecke, K., Cihoric, N., and Reichenpfader, D. (2023). “Designing a digital medical interview assistant for radiology,” in Studies in Health Technology and Informatics , eds. B. Pfeifer, G. Schreier, M. Baumgartner and D. Hayn (Amsterdam: IOS Press), 66.

Google Scholar

Denecke, K., and May, R. (2023). Developing a technical-oriented taxonomy to define archetypes of conversational agents in health care: literature review and cluster analysis. J. Med. Internet Res . 25:e41583. doi: 10.2196/41583

Denecke, K., May, R., Group, L., and Rivera-Romero, O. (2024). Potentials of large language models in healthcare: a delphi study. J. Med. Internet Res . 2024:52399. doi: 10.2196/52399

Dhinagaran, D. A., Martinengo, L., Ho, M.-H. R., Joty, S., Kowatsch, T., Atun, R., et al. (2022). Designing, developing, evaluating, and implementing a smartphone-delivered, rule-based conversational agent (discover): development of a conceptual framework. JMIR mHealth uHealth 10:e38740. doi: 10.2196/38740

Doyle, C., Lennox, L., and Bell, D. (2013). A systematic review of evidence on the links between patient experience and clinical safety and effectiveness. Br. Med. J. Open 3:e001570. doi: 10.1136/bmjopen-2012-001570

Fabian, B., Ermakova, T., and Lentz, T. (2017). “Large-scale readability analysis of privacy policies,” in Proceedings of the International Conference on Web Intelligence, WI '17 (New York, NY: Association for Computing Machinery), 18–25,.

Griffin, A. C., Khairat, S., Bailey, S. C., and Chung, A. E. (2023). A chatbot for hypertension self-management support: user-centered design, development, and usability testing. J. Am. Med. Assoc. Open 6:eooad073. doi: 10.1093/jamiaopen/ooad073

Han, Y., Moore, J. X., Colditz, G. A., and Toriola, A. T. (2022). Family history of breast cancer and mammographic breast density in premenopausal women. J. Am. Med. Assoc. Netw. Open 5:e2148983. doi: 10.1001/jamanetworkopen.2021.48983

HL7 International (2023). SDC Home Page—Structured Data Capture v3.0.0 . Available at: https://build.fhir.org/ig/HL7/sdc/ (accessed May 7, 2024).

Jones, S., Turton, P., and Achuthan, R. (2020). Impact of family history risk assessment on surgical decisions and imaging surveillance at breast cancer diagnosis. Ann. Royal Coll. Surg. Engl . 102, 590–593. doi: 10.1308/rcsann.2020.0103

Koney, N., Roudenko, A., Ro, M., Bahl, S., and Kagen, A. (2016). Patients want to meet with imaging experts. J. Am. Coll. Radiol . 13, 465–470. doi: 10.1016/j.jacr.2015.11.011

Langevin, R., Lordon, R. J., Avrahami, T., Cowan, B. R., Hirsch, T., and Hsieh, G. (2021). “Heuristic evaluation of conversational agents,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (New York, NY: Association for Computing Machinery), 1–15.

Laranjo, L., Dunn, A. G., Tong, H. L., Kocaballi, A. B., Chen, J., Bashir, R., et al. (2018). Conversational agents in healthcare: a systematic review. J. Am. Med. Informat. Assoc . 25, 1248–1258. doi: 10.1093/jamia/ocy072

Milne-Ives, M., de Cock, C., Lim, E., Shehadeh, M. H., de Pennington, N., Mole, G., et al. (2020). The effectiveness of artificial intelligence conversational agents in health care: systematic review. J. Med. Internet Res . 22:e20346. doi: 10.2196/20346

Nadarzynski, T., Miles, O., Cowie, A., and Ridge, D. (2019). Acceptability of artificial intelligence (AI)-led chatbot services in healthcare: a mixed-methods study. Digit. Health 5:2055207619871808. doi: 10.1177/2055207619871808

Nairz, K., Böhm, I., Barbieri, S., Fiechter, D., Hošek, N., and Heverhagen, J. (2018). Enhancing patient value efficiently: medical history interviews create patient satisfaction and contribute to an improved quality of radiologic examinations. PLoS ONE 13:e0203807. doi: 10.1371/journal.pone.0203807

Nielsen, J. (2006). Quantitative Studies: How Many Users to Test? Available at: https://www.nngroup.com/articles/quantitative-studies-how-many-users/ (accessed August 6, 2024).

Pappas, Y., Anandan, C., Liu, J., Car, J., Sheikh, A., and Majeed, A. (2011). Computer-assisted history-taking systems (CAHTS) in health care: benefits, risks and potential for further development. Inform Prim. Care . 19, 155–160. doi: 10.14236/jhi.v19i3.808

Pringle, M. (1988). Using computers to take patient histories. Br. Med. J . 297:697.

Reichenpfader, D., Rösslhuemer, P., and Denecke, K. (2024). Large language model-based evaluation of medical question answering systems: algorithm development and case study. Stud. Health Technol. Informat . 313, 22–27. doi: 10.3233/SHTI240006

Rockall, A. G., Justich, C., Helbich, T., and Vilgrain, V. (2022). Patient communication in radiology: moving up the agenda. Eur. J. Radiol . 155:110464. doi: 10.1016/j.ejrad.2022.110464

Rubin D. L. and Kahn Jr, C. E.. (2017). Common data elements in radiology. Radiology 283, 837–844. doi: 10.1148/radiol.2016161553

Slack, W. V., Hicks, P., Reed, C. E., and Van Cura, L. J. (1966). A computer-based medical-history system. N. Engl. J. Med . 274, 194–198.

Spinazze, P., Aardoom, J., Chavannes, N., and Kasteleyn, M. (2021). The computer will see you now: overcoming barriers to adoption of computer-assisted history taking (CAHT) in primary care. J. Med. Internet Res . 23:e19306. doi: 10.2196/19306

Steerling, E., Siira, E., Nilsen, P., Svedberg, P., and Nygren, J. (2023). Implementing ai in healthcare—the relevance of trust: a scoping review. Front. Health Serv . 3:1211150. doi: 10.3389/frhs.2023.1211150

Taslakian, B., Sebaaly, M. G., and Al-Kutoubi, A. (2016). Patient evaluation and preparation in vascular and interventional radiology: what every interventional radiologist should know (part 1: patient assessment and laboratory tests). Cardiovasc. Intervent. Radiol . 39, 325–333. doi: 10.1007/s00270-015-1228-7

Tudor Car, L., Dhinagaran, D. A., Kyaw, B. M., Kowatsch, T., Joty, S., Theng, Y.-L., et al. (2020). Conversational agents in health care: scoping review and conceptual analysis. J. Med. Internet Res . 22:e17158. doi: 10.2196/17158

Wei, I., Pappas, Y., Car, J., Sheikh, A., and Majeed, A. (2011). Computer-assisted vs. oral-and-written dietary history taking for diabetes mellitus. Cochr. Datab. Systemat. Rev . 12:CD008488. doi: 10.1002/14651858.CD008488.pub2

Yardley, L., Morrison, L., Bradbury, K., and Muller, I. (2015). The person-based approach to intervention development: application to digital health-related behavior change interventions. J. Med. Internet Res . 17:e4055. doi: 10.2196/jmir.4055

Keywords: medical history taking, conversational agent, consumer health information, algorithms, patients, radiology, user-centered design, natural language processing

Citation: Denecke K, Reichenpfader D, Willi D, Kennel K, Bonel H, Nairz K, Cihoric N, Papaux D and von Tengg-Kobligk H (2024) Person-based design and evaluation of MIA, a digital medical interview assistant for radiology. Front. Artif. Intell. 7:1431156. doi: 10.3389/frai.2024.1431156

Received: 11 May 2024; Accepted: 22 July 2024; Published: 16 August 2024.

Reviewed by:

Copyright © 2024 Denecke, Reichenpfader, Willi, Kennel, Bonel, Nairz, Cihoric, Papaux and von Tengg-Kobligk. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kerstin Denecke, kerstin.denecke@bfh.ch

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

sustainability-logo

Article Menu

usability evaluation methods a literature review

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Community resilience evaluation and construction strategies in the perspective of public health emergencies: a case study of six communities in nanjing.

usability evaluation methods a literature review

1. Introduction

2. literature review, 2.1. resilience theory, 2.2. resilience assessment, 3. materials and methods, 3.1. evaluation system, 3.2. weights, 4. samples and results, 4.1. study area.

  • Suojin Village Community (I) is established in the 1980s, the community covers an area of 0.45 square kilometers and has a population of 12,000. It located in the central part of Xuanwu District, bordered by Xuanwu Lake Street to the east, Taipingmen and Houzaimen Street to the south, Xuanwu Lake and Xuanwumen Street to the west, and the Shanghai–Nanjing Railway and Hongshan Street to the north.
  • Bancang Community (II) was established in the 1980s, covering an area of 0.28 square kilometers with a population of 6800. It is bordered by the Purple Mountain to the south, Jiangwangmiao Community to the east, Xuanwu Lake to the west, and is adjacent to Suojincun Street.
  • Zixincheng Community (III) was established in the 1980s, covering an area of 0.35 square kilometers with a population of 6000. It is bordered by Purple Mountain to the east, Xuanwu Lake to the west, Baima Park and Bei’anmen Street to the south, and Ningxi Road to the north.
  • Jiangwangmiao Community (IV) was established in the 1990s, covering an area of 0.32 square kilometers with a population of 6000. It is bordered by Ningxi Road to the south, National Highway 312 to the west, and Huaxin West Road to the north.
  • Huayuan Road Community (V) was established in the 1990s, covering an area of 0.47 square kilometers with a population of 13,700. Garden Road runs through the community, which is bordered by Huaxin West Road to the east, Nanjing Forestry University to the west, Garden Road Neighborhoods 5 and 8 to the south, and Xuanwu Avenue to the north.
  • Yingtie Village Community (VI) was established in the 1990s, covering an area of 0.57 square kilometers with a population of 13,000. It is bordered by the Jingwu Overpass to the east, Yingtuo Huayuan Road Community to the south, the East Long-Distance Bus Station to the west, and Xuanwu Avenue to the north.

4.2. Resilience Evaluation Results

4.2.1. first-level indicator, 4.2.2. second-level indicator, 4.2.3. third-level indicator, 4.3. validity and reliability of the empirical results.

  • Suojin Village Community (I) established an efficient information communication mechanism to ensure that residents were kept up to date with the latest pandemic developments. Among the interviewed residents, there was widespread satisfaction with the community’s pandemic prevention and control performance. However, it performs poorly in pedestrian and bicycle lane (B1). Thus, it is advised that the walking and cycling systems be improved, the public transit system be made more accessible, the transportation network and transfer facilities be laid out sensibly, and the environment for slow traffic be improved in the ensuing resilience enhancement construction. Furthermore, the community slow traffic road system’s design and optimization must be combined in order to offer adaptable motor vehicle management plans, such as restricting parking during public health emergencies to lower safety risks.
  • Bancang Community (II) performs well in public service facilities (B9) and recovery adaptability (B14). Residents expressed universal satisfaction with the community’s pandemic prevention and control performance: during the pandemic, the town plaza was rapidly converted into an emergency center, significantly slowing the spread of the virus. However, it performs poorly in open space (B4). Thus, it is advised that open green spaces be used for more purposes and that outdoor activity areas be planned with consideration for the local climate and the demands of the occupants throughout the ensuing resilience enhancement construction. In addition, it may guarantee comfortable slow traffic by improving the accessibility and connection of public areas like parks and block green areas. Simultaneously, adjust the spatial layout in accordance with the requirements of the inhabitants, such as by including rest spaces and lights in the park, to provide a better open area for everyday community activities.
  • Zixincheng Community (III) performs poorly in emergency support facilities (B10). The majority of the citizens were dissatisfied with the community’s pandemic prevention and control performance. It is recommended to improve the configuration of medical equipment to ensure meeting various medical needs. Balanced layout and adding facilities to fill gaps and expand service coverage are essential. Additionally, establishing a 15-min disaster prevention and epidemic prevention zone and increasing facilities such as health stations can enhance epidemic prevention capabilities.
  • Jiangwangmiao Community (IV) performs poorly in transportation space (B3). It is recommended to optimize the punctuality of public transportation and integrate non-motorized transportation, increase the density of bus stops, and reduce waiting times. Additionally, it is crucial to strategically allocate public transportation, medical facilities, and open spaces, establish a network of slow traffic and life services covering the community, and promote the development of a healthy community.
  • Huayuan Road Community (V) performs poorly in emergency defense space (B7). It is recommended to optimize emergency shelters to respond to public health emergencies. It is suggested to establish construction standards that match the community, renovate public buildings to meet disaster response needs, and consider public and commercial facilities as potential shelters. Establishing and updating relevant databases for the rapid conversion of space use is also recommended.
  • Yingtie Village Community (VI) performs poorly in supply storage space (B6). It is recommended to improve community emergency material reserves by establishing dedicated storage facilities. Implementing efficient material storage and rotation systems, and integrating community resources to optimize emergency provisioning, are crucial steps. Ensuring the seamless supply and utilization of materials in both emergency and normal situations, covering all residents and organizations, will enhance emergency response capabilities.

5. Discussion

5.1. optimization strategy from a full cycle perspective, 5.1.1. optimization strategy for preparation and prevention phase, 5.1.2. optimization strategy for impact and response phase, 5.1.3. optimization strategy for recovery and adaptation phase, 5.2. practical application in real-world circumstances, 5.3. limitations of the study, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest.

SourceWebsite Link
New and used community dataWeibo (accessed on 8 August 2022)
Wechat (accessed on 8 August 2022)
Evaluation questionnaire dataQuestionnaire network (accessed on 5 February 2024)
Software and algorithmsAnalytic Hierarchy Process (accessed on 23 March 2024)
Python (accessed on 8 August 2022)
Auto CAD (accessed on 13 May 2022)
Excel (accessed on 23 February 2024)
ArcGIS (accessed on 23 March 2024)
Space syntax (accessed on 23 March 2024)
  • Wu, Z.Q.; Feng, F.; Lu, F. Space design for urban resilience. Time Archit. 2020 , 4 , 84–89. [ Google Scholar ] [ CrossRef ]
  • Wu, Z.Q.; Lu, F.; Yang, T.; Feng, F. Challenges for urban space governance under the major epidemic impack. City Plan. Rev. 2020 , 44 , 9–12. [ Google Scholar ]
  • Norris, F.H.; Stevens, S.P.; Pfefferbaum, B.; Wyche, K.F. Community resilience as a metaphor, theory, set of capacities, and strategy for disaster readiness. Am. J. Community Psychol. 2008 , 41 , 127–150. [ Google Scholar ] [ CrossRef ]
  • Duan, J.; Yang, B.; Zhou, L. Planning improves city’s immunity: A written conversation on COVID-19 breakout. City Plan. Rev. 2020 , 44 , 115–136. [ Google Scholar ]
  • Lim, S.; Allen, K.; Bhutta, Z. Measuring the health-related Sustainable Development Goals in 188 countries: A baseline analysis from the Global Burden of Disease Study 2015. Lancet 2016 , 388 , 1813–1850. [ Google Scholar ] [ CrossRef ]
  • Cutter, S.L. The landscape of disaster resilience indicators in the USA. Nat. Hazards 2016 , 80 , 741–758. [ Google Scholar ] [ CrossRef ]
  • Collier, M.J.; Nedović-Budić, Z.; Aerts, J. Transitioning to resilience and sustainability in urban communities. Cities 2013 , 32 , 21–28. [ Google Scholar ] [ CrossRef ]
  • Garcia-Perez, A.; Cegarra-Navarro, J.G.; Sallos, M.P. Resilience in healthcare systems: Cyber security and digital transformation. Technovation 2023 , 121 , 102583. [ Google Scholar ] [ CrossRef ]
  • Cimellaro, G.P.; Reinhorn, A.M.; Bruneau, M. Framework for analytical quantification of disaster resilience. Eng. Struct. 2010 , 32 , 3639–3649. [ Google Scholar ] [ CrossRef ]
  • Gunderson, L.H.; Holling, C.S. Panarchy: Understanding Transformations in Human and Natural Systems ; Island Press: Washington, DC, USA, 2002. [ Google Scholar ]
  • Gunderson, L.H.; Holling, C.S.; Pritchard, L. Resilience of large-scale resource systems. Scope-Sci. Comm. Probl. Environ. Int. Counc. Sci. Unions 2002 , 60 , 3–20. [ Google Scholar ]
  • Galderisi, A.; Limongi, G.; Salata, K.D. Strengths and weaknesses of the 100 resilient cities initiative in southern Europe: Rome and Athens’ experiences. City Territ. Archit. 2020 , 7 , 16. [ Google Scholar ] [ CrossRef ]
  • Mileti, D. Disasters by Design: A Reassessment of Natural Hazards in the United States ; Joseph Henry Press: Washington, DC, USA, 1999. [ Google Scholar ] [ CrossRef ]
  • UNISDR. Living with Risk: A Global Review of Disaster Reduction Initiatives ; United Nations: Geneva, Switzerland, 2004. [ Google Scholar ]
  • Coles, E.; Buckle, P. Developing community resilience as a foundation for effective disaster recovery. Aust. J. Emerg. Manag. 2004 , 19 , 6–15. [ Google Scholar ]
  • Cutter, S.L.; Barnes, L.; Berry, M. A Place-based model for understanding community resilience to natural disasters. Glob. Environ. Change 2008 , 8 , 598–606. [ Google Scholar ] [ CrossRef ]
  • Twigg, J. Characteristics of a Disaster-Resilient Community: A Guidance Note , 2nd ed.; Aon Benfield UCL Hazard Research Centre: London, UK, 2009. [ Google Scholar ]
  • Magis, K. Community Resilience: An Indicator of Social Sustainability. Soc. Nat. Resour. Int. J. 2010 , 23 , 401–416. [ Google Scholar ] [ CrossRef ]
  • Chandra, A.; Acosta, J.; Howard, S.; Uscher-Pines, L. Building Community Resilience to Disasters: A way forward to Enhance Nationa1 Health Security. Rand Health Q. 2011 , 1 , 6. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Lopez-Marrero, T.; Tschakert, P. From theory to practice: Building more resilient communities in flood-prone areas. Environ. Urban. 2011 , 23 , 229–249. [ Google Scholar ] [ CrossRef ]
  • Bruneau, M.; Chang, S.E.; Eguchi, R.T. A framework to quantitatively assess and enhance the seismic resilience of communities. Earthq. Spectra 2012 , 19 , 733–752. [ Google Scholar ] [ CrossRef ]
  • Berkes, F.; Ross, H. Community Resilience: Toward an Integrated Approach. Soc. Nat. Resour. Int. J. 2013 , 26 , 5–20. [ Google Scholar ] [ CrossRef ]
  • Fabbricatti, K.; Boissenin, L.; Citoni, M. Heritage Community Resilience: Towards new approaches for urban resilience and sustainability. City Territ. Archit. 2020 , 7 , 17. [ Google Scholar ] [ CrossRef ]
  • Rifat, S.A.; Liu, W. Measuring community disaster resilience in the conterminous coastal United States. ISPRS Int. J. Geo-Inf. 2020 , 9 , 469. [ Google Scholar ] [ CrossRef ]
  • Jewett, R.L.; Mah, S.M.; Howell, N. Social cohesion and community resilience during COVID-19 and pandemics: A rapid scoping review to inform the United Nations research roadmap for COVID-19 recovery. Int. J. Health Serv. 2021 , 51 , 325–336. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Fransen, J.; Peralta, D.O.; Vanelli, F. The emergence of urban community resilience initiatives during the COVID-19 pandemic: An international exploratory study. Eur. J. Dev. Res. 2022 , 34 , 432–454. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Peng, C.; Guo, Z.; Peng, Z. Research Progress on the Theory and Practice of Foreign Community Resilience. Urban Plan. Int. 2017 , 32 , 60–66. [ Google Scholar ] [ CrossRef ]
  • O’Connell, D.; Walker, B.; Abel, N. The Resilience, Adaptation and Transformation Assessment Framework: From Theory to Application ; Csiro: Canberra, Australia, 2015. [ Google Scholar ] [ CrossRef ]
  • Meng, L.J.; Yun, Y.X. Disaster Resilience Improvement Strategy of Existing Communities Based on RATA Resilience Evaluation System: A Case Study of Existing Communities in Dongxing Road, Hedong District, Tianjin. In Proceedings of the 60 Years of Planning: Achievements and Challenges: Annual National Planning Conference, Shenyang, China, 24–27 September 2016; pp. 194–205. [ Google Scholar ]
  • Yang, B.Q.; Li, G.C. Evaluation and analysis of social resilience of international communities based on DPSRC model: A case study of 16 international communities in Xiaobei, Guangzhou. Areal Res. Dev. 2020 , 39 , 70–75. [ Google Scholar ] [ CrossRef ]
  • Yan, C.; Chen, J.T.; Duan, R. Evaluation Index System for Fireproof Resilience of Historic Blocks Based on PSR Model: A Case of Three Lanes and Seven Alleys in Fuzhou. Sci. Technol. Eng. 2021 , 21 , 3290–3296. [ Google Scholar ] [ CrossRef ]
  • Zhang, F.; Liu, Q.; Zhou, X. Vitality Evaluation of Public Spaces in Historical and Cultural Blocks Based on Multi-Source Data, a Case Study of Suzhou Changmen. Sustainability 2022 , 14 , 14040. [ Google Scholar ] [ CrossRef ]
  • Shang, Z.H.; Ou, X.J.; Zeng, L.H.; He, J.Q. Risk Assessment of City Community Public Safety: A Case Study of Chigang Community of Humen Town, Dongguan. Trop. Geogr. 2013 , 33 , 195–199. [ Google Scholar ] [ CrossRef ]
  • Golany, B.; Roll, Y. An application procedure for DEA. Omega 1989 , 17 , 237–250. [ Google Scholar ] [ CrossRef ]
  • Sun, M.; Zhu, T. Review on the Evaluation System of Public Safety Carrying Capacity about Small Town Community. Asian Agric. Res. 2014 , 6 , 77–79. [ Google Scholar ] [ CrossRef ]
  • Zheng, B.; Hao, Y.H.; Ning, N. Community resilience to disaster risk in Sichuan province of China: An analysis of TOPSIS. Chin. J. Public Health 2017 , 33 , 699–702. [ Google Scholar ] [ CrossRef ]
  • Guo, X.D.; Su, J.Y.; Wang, Z.T. Urban safety and disaster prevention under the perspective of resilience theory. Shanghai Urban Plan 2016 , 71 , 41–44. [ Google Scholar ]
  • Fox-Lent, C.; Linkov, I. Resilience Matrix for Comprehensive Urban Resilience Planning ; Springer International Publishing: Cham, Switzerland, 2018. [ Google Scholar ] [ CrossRef ]
  • Fox-Lent, C.; Bates, M.E.; Linkov, l. A matrix approach to community resilience assessment: Anillustrative case at Rockaway Peninsula. Environ. Syst. Decis. 2015 , 35 , 209–218. [ Google Scholar ] [ CrossRef ]
  • Chen, C.K.; Chen, Y.Q.; Shi, B.O.; Xu, T. An model for evaluating urban resilience to rainstorm flood disasters. China Saf. Sci. J. 2018 , 28 , 1–6. [ Google Scholar ] [ CrossRef ]
  • Shi, Y.; Ji, F.; Zhang, H.B. Research on evaluation indicators of disaster resilience of urban communities. J. Acad. Disaster Prev. Sci. Technol 2019 , 21 , 47–54. [ Google Scholar ]
  • Zhang, F.; Zhou, X. Structural renovation of blocks in build-up area of Jiangnan cities, taking Suzhou new district as an example. iScience 2023 , 26 , 108553. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhang, Y.L. Study on Community Emergency Capacity Assessment Based on the Fuzzy Comprehensive Assessment. Ind. Saf. Environ. Prot. 2011 , 37 , 14–16. [ Google Scholar ] [ CrossRef ]
  • Zhou, X.; Ye, F.; Zhang, F.; Wang, D. Analysis and Optimization of Residential Elements from the Perspective of Multi-Child Families in the Yangtze River Delta Region. Buildings 2024 , 14 , 1649. [ Google Scholar ] [ CrossRef ]
  • Moghadas, M.; Asadzadeh, A.; Vafeidis, A. A multi-criteria approach for assessing urban flood resilience in Tehran, Iran. Int. J. Disaster Risk Reduct. 2019 , 35 , 101069. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

DimensionPreparedness and Prevention Phase
(P1)
Impact and Response Phase
(P2)
Recovery and Adaptation Phase
(P3)
Built environment
(A1)
(A1,P1)
To maintain a good community environment and enhance the friendliness of public spaces, encouraging residents to participate in outdoor activities.
(A1,P2)
Restricting external traffic flow at community entrances and equipping open public spaces with enhanced epidemic prevention functions to ensure residents’ physical and mental well-being.
(A1,P3)
Establishing community parks, pocket green spaces, and other public recreational areas, and utilizing linear greenery as a natural barrier to reduce health risks.
Emergency spaces
(A2)
(A2,P1)
Planning adequate isolation spaces and layout of refuge areas, ensuring sufficient evacuation areas.
(A2,P2)
Always ensure the security of emergency spaces and strive for unobstructed emergency routes.
(A2,P3)
Expanding the number of emergency spaces, repair damaged areas, and meeting the dual requirements of emergency and daily use.
Critical facilities
(A3)
(A3,P1)
Increase the redundancy of community facilities and cultivate residents’ awareness of using safety facilities.
(A3,P2)
Fully utilize community hospitals, sports facilities, leisure and health centers, and other health facilities for emergency interventions to minimize residents’ health injuries.
(A3,P3)
Accelerate the restoration of postal, express delivery, and other transportation facilities to meet the dynamic needs of integrating community services during and after pandemics.
Organizational behavior
(A4)
(A4,P1)
Conduct early warning and prevention information campaigns; perform safety hazard inspections.
(A4,P2)
Initiate emergency rescue and evacuation operations; formulate disaster response plans.
(A4,P3)
Announce the disaster situation and ongoing efforts; promote community spirit of mutual assistance; enhance the level of health activities for residents.
First-Level Indicator APhaseSecond-Level Indicator BThird-Level Indicator CMeasurement Methods
Resilience of built environment
A1
P1Pedestrian and bicycle lane
B1
C1 Street visual comfortSemantic segmentation
C2 Perception of street scaleStreet height-to-width ratio
Land use
B2
C3 Land development intensityBuilding density formula
C4 Land use diversityLand use formula
P2Transportation space
B3
C5 Road integrationSpaceSyntax
C6 Road connectivityThe ratio of intersections to sidewalks
P3Open space
B4
C7 Spatial coverageThe ratio of open space area to the total community area
C8 Morphological compactnessCompactness Index formula
Resilience of emergency space
A2
P1Emergency shelter signage system
B5
C9 Signage utilityQuestionnaire
C10 Layout rationalityField research
Supply storage space
B6
C11 Spatial coverageService coverage of supply points
C12 Material supply levelTwo-Step floating catchment area method
P2Emergency defense space
B7
C13 Accessibility of placesThe shortest distance from shelter to hospital
C14 Coverage of placesShelter service area
C15 Safety of emergency accessRoad congestion
P3Post-pandemic integration area
B8
C16 Operability of post-pandemic transitionPercentage of operable space units
C17 Scale of spatial planning for post-pandemic transitionArea of the epidemic prevention space
Resilience of critical facilities
A3
P1Public service facilities
B9
C18 Facility equityLocation entropy index
C19 Facility coveragePublic facility service coverage
P2Emergency support facilities
B10
C20 Provision of healthcare facilitiesTwo-step floating catchment area method
C21 Accessibility of healthcare facilitiesTwo-step floating catchment area method
P3Post-pandemic integration facilities
B11
C22 Number of available existing facilitiesField research
C23 Facility maintenanceField research
Resilience of organizational behavior
A4
P1Preventive baseline conditions
B12
C24 Residents’ disaster awarenessQuestionnaire
C25 Community disaster preparedness levelQuestionnaire
P2Emergency preparedness level
B13
C26 Level of resident activityStandard deviational ellipse
C27 Community organizational capacityPython
P3Recovery adaptability
B14
C28 Healthiness of activitiesQuestionnaire
C29 Restoration participationPython
First-Level Indicator WeightSecond-Level Indicator Second-Level
Combined Weight
Third-Level Indicator Third-Level
Combined Weight
Ranking
Resilience of built environment
A1
0.3444 B1 0.1586 C10.0529 9
C20.1057 1
B2 0.0257 C30.0086 26
C40.0171 18
B3 0.0470 C50.0117 23
C60.0352 12
B4 0.1131 C70.0754 3
C80.0377 11
Resilience of emergency space
A2
0.2111 B5 0.0205 C90.0068 28
C100.0137 22
B6 0.0220 C110.0055 29
C120.0165 19
B7 0.0505 C130.0202 15
C140.0101 25
C150.0202 15
B8 0.1181 C160.0591 6
C170.0591 6
Resilience of critical facilities
A3
0.2472 B90.0489 C180.0326 13
C190.0163 20
B100.1212 C200.0606 4
C210.0606 4
B11 0.0771 C220.0578 8
C230.0193 17
Resilience of organizational behavior
A4
0.1972 B12 0.0415 C240.0103 24
C250.0311 14
B13 0.0475 C260.0079 27
C270.0400 10
B140.1082 C280.0927 2
C290.0155 21
Sample CommunitiesIIIIIIIVVVI
Comprehensive resilience evaluation3.11583.5022.24152.59793.37532.5028
Ranking316425
Analysis ItemsNameSample SizeAverage ValueStandard DeviationBrown Fp
Resilience evaluationCommunity I290.030.052.3400.046
Community II290.030.04
Community III290.080.14
Community IV290.080.11
Community V290.040.06
Community VI290.050.07
Total1740.050.09
DimensionIIIIIIIVVVI
Built environment 0.35010.52870.31720.68720.63440.5287
Emergency spaces0.15220.11650.08920.12630.16180.063
Critical facilities0.24450.24450.17930.11410.1630.0815
Organizational behavior0.09310.13450.05170.05170.10340.0828
Total0.83991.02420.63740.97931.06260.756
DimensionIIIIIIIVVVI
Built environment 0.17590.15240.10550.05860.14070.1876
Emergency spaces0.2020.19190.07070.12120.15150.1212
Critical facilities0.48480.42420.18180.54540.54540.1818
Organizational behavior0.11160.14370.10370.07160.15160.0795
Total0.97430.91220.46170.79680.98920.5701
DimensionIIIIIIIVVVI
Built environment 0.22620.22620.22620.22620.33930.4524
Emergency spaces0.47280.47280.29550.17730.41370.2364
Critical facilities0.15430.23130.2120.13490.19280.1928
Organizational behavior0.44830.5410.34010.23190.30910.2009
Total1.30161.47131.07380.77031.25491.0825
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Zhang, F.; Wang, D.; Zhou, X.; Ye, F. Community Resilience Evaluation and Construction Strategies in the Perspective of Public Health Emergencies: A Case Study of Six Communities in Nanjing. Sustainability 2024 , 16 , 6992. https://doi.org/10.3390/su16166992

Zhang F, Wang D, Zhou X, Ye F. Community Resilience Evaluation and Construction Strategies in the Perspective of Public Health Emergencies: A Case Study of Six Communities in Nanjing. Sustainability . 2024; 16(16):6992. https://doi.org/10.3390/su16166992

Zhang, Fang, Dengyu Wang, Xi Zhou, and Fan Ye. 2024. "Community Resilience Evaluation and Construction Strategies in the Perspective of Public Health Emergencies: A Case Study of Six Communities in Nanjing" Sustainability 16, no. 16: 6992. https://doi.org/10.3390/su16166992

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

  • Open access
  • Published: 15 August 2024

Validity of evaluation scales for post-stroke depression: a systematic review and meta-analysis

  • Fang Liu 1 ,
  • Lei Gong 2 ,
  • Huan Zhao 1 ,
  • Ying-li Li 1 ,
  • Zhiwen Yan 1 &
  • Jun Mu   ORCID: orcid.org/0000-0002-1866-744X 1  

BMC Neurology volume  24 , Article number:  286 ( 2024 ) Cite this article

Metrics details

Post-stroke depression (PSD) is closely associated with poor stroke prognosis. However, there are some challenges in identifying and assessing PSD. This study aimed to identify scales for PSD diagnosis, assessment, and follow-up that are straightforward, accurate, efficient, and reproducible.

A systematic literature search was conducted in 7 electronic databases from January 1985 to December 2023.

Thirty-two studies were included, the Patient Health Questionnaire-9 (PHQ-9) and Hamilton Depression Scale (HDRS) had higher diagnostic accuracy for PSD. The sensitivity, specificity, and diagnostic odds ratio of PHQ-9 or diagnosing any depression were 0.82, 0.87, and 29 respectively. And for HDRS, used for diagnosing major depression, the scores were 0.92, 0.89, and 94. Furthermore, these two scales also had higher diagnostic accuracy in assessing depressive symptoms during both the acute and chronic phases of stroke. In patients with post-stroke aphasia and cognitive impairment, highly diagnostic scales have not been identified for assessing depressive symptoms yet.

Conclusions

The PHQ-9 and HDRS scales are recommended to assess PSD. HDRS, which demonstrates high diagnostic performance, can replace structured interviews based on diagnostic criteria.

Peer Review reports

Introduction

Stroke is a significant cardiovascular disease, with its incidence rate and associated disease risks being of global concern [ 1 ].With the increasing incidence of stroke worldwide, the number of people suffering from post-stroke depression (PSD) has increased significantly [ 2 ]. PSD is one of the most common complications after the stroke. The main manifestations are depressive mood and loss of interest, often accompanied by somatic symptoms such as weight loss, insomnia, and fatigue [ 3 , 4 ]. PSD seriously hinders the recovery of neurological function in stroke patients, leading to prolonged hospital stays loss of social interaction and independent living skills, and even increased stroke recurrence and mortality [ 5 , 6 ]. Therefore, early diagnosis and treatment of PSD are crucial for prognosis. Currently, the diagnosis of PSD is still based on structured interviews [ 7 ]. Since the pathogenesis of PSD is not entirely clear [ 8 ], the dual effects of stroke-induced brain damage and mental stress complicate its diagnosis. Presently, PSD is classified as a mental disorder rather than neurological disorder. For example, in the Diagnostic and Statistical Manual of Mental Disorders—5th Edition (DSM-V), PSD is categorized under depressive disorder due to other physical diseases [ 7 ]; In the 10th edition of the International Classification of Mental Disorders (ICD-10), it is classified as an organic mental disorder [ 9 ]; Similarly, in the Chinese Classification and Diagnostic Standard of Mental Disorders (CCMD-3), it is regarded as a mental disorder caused by cerebrovascular diseases [ 10 ]. The diverse diagnostic criteria across to different classification systems further complicate the diagnosis of PSD. Additionally, most of the scales used to assess PSD usually refer to the scales of Major Depressive Disorder (MDD) [ 4 , 11 ].

There are mainly three types of depression scales. Firstly, self-rating scales, such as Patient Health Questionnaire-9 (PHQ-9), Beck Depression Inventory (BDI), and Self-rating Depression Scale (SDS). Secondly, clinician-rated scales, including Hamilton Depression Rating Scale (HDRS) and Montgomery Asberg Depression Rating Scale (MADRS). Thirdly, depression assessment scales for specific populations are Geriatric Depression Screening Scale (GDS) and Stroke Aphasic Depression Questionnaire (SADQ-10). Due to the lack of uniform standards, clinical studies may apply different scales to assess the same PSD populations or use a single scale to assess PSD populations with different characteristics. The validity of these scales varies widely, leading to differences in the epidemiology, diagnosis, and assessment of PSD. Although some research teams have developed PSD-specific scales, such as Post-Stroke Depression Symptom Inventory (PSDS) [ 12 ] and Post-Stroke Depression Prediction Scale (DePreS) [ 13 ], their validity is still under clinical evaluation and they are not widely used.

Therefore, it is urgent to identify scales that can simplify the diagnostic process of PSD and facilitate the prognosis evaluation. This meta-analysis aimed to select the accurate, simple and reproducible assessment scales for PSD.

Literature search

Through computer retrieval, seven English electronic databases (PubMed, EMBASE, Medline, Web of Science, Clinical trial.gov, CINAHL, and Cochrane library) were searched for published literature on PSD and scale assessment from January 1985 to December 2023.The search scope included title and abstract, and the language was limited to English. According to the Medical Subject Headings (MeSH), the searched keywords include:

Post-stroke depression: ‘post-stroke depression’ or ‘post stroke depression’ or ‘PSD’ or ‘depression after stroke’ or ‘emotional disturbances after stroke’ or ‘emotionalism after stroke’ or ‘vascular depression’ or ‘post stroke depressive disorder’ or ‘depressive disorder after stroke’.

Assessment: ‘assessment scale’ or ‘validity’ or ‘measure’ or ‘measures’ or ‘evaluation’.

The retrieval formula was (#1 and #2) not (‘Meta-Analysis’ or ‘Review’ or ‘Systematic Review’).

Inclusion and exclusion criteria

Inclusion criteria were as follows:.

The studies were original studies, including case-control and cohort studies with a clearly defined period of development or publication.

The study content involved the use of depression scales to evaluate PSD

Participants met the diagnostic criteria for stroke

The evaluation of PSD adhered to the relevant classification and diagnostic criteria (DSM, ICD, CCMD)

The study needed to provide the number of patients with stroke and PSD.

Exclusion criteria were:

Animal studies related to PSD

Lack of clear criteria for the diagnosis of stroke

Failure to use the diagnostic criteria for PSD based on structured interviews or assessments

Researchers did not adopt scientific data collection methods

Inappropriate use of statistical methods in research or errors in data analysis

Reviews, systematic reviews, dissertations, conference papers, and repeated publications

The literature was not in English.

Study selection

We included, but not limit to, the following types of scales: ‘The Patient Health Questionnaire-2 (PHQ-2)’, ‘The Patient Health Questionnaire-9 (PHQ-9)’, ‘Center for Epidemiological Studies-Depression(CES-D)’, ‘Montgomery Asberg Depression Rating Scale(MADRS)’, ‘Beck Depression Inventory(BDI)’, ‘Hamilton Depression Rating Scale(HDRS or HAMD)’, ‘Hospital Anxiety and Depression Scale(HADS)”, ‘Self-Rating Depression Scale(SDS)’, ‘The Geriatric Depression Scale(GDS)’, ‘Post stroke depression scale(PSDS)’, ‘ Post Stroke Depression Rating Scale(PSDRS)’, ‘Visual Analog Mood Scale(VAMS)’, and ‘Stroke Aphasic Depression Questionnaire Hospital Version( SADQ-H)’.

Data extraction

Firstly, the selected studies in the database were entered into the EndNote X9.3.2 software (Thomson Scientific, America). After screening for duplicate studies, the titles and abstracts of the remaining studies were screened again. Secondly, included studies were identified after reading the full text of each study according to the inclusion and exclusion criteria. The extracted data mainly included: author, publication time, number of cases, assessment scales and cut-offs, PSD diagnostic criteria, type of stroke, onset time of stroke when evaluating depressive symptoms, and type of depression.

Quality evaluation

Two reviewers independently assessed the quality and risk of bias of all included studies using The Risk Of Bias In Non-randomized Studies – of Interventions (ROBINS-I) [ 14 ], Any disagreements between the reviewers were be discussed with the superior expert until a consensus was reached.

Data analysis

The RevMan 5.4 statistical software provided by Cochrane collaboration was used for quality assessment of the data and statistical description. We used Stata15.1 software for meta-analysis and heterogeneity test. In cases where the heterogeneity between studies was P  > 0.1 and I 2  < 50%, we employed a fixed-effect model for comprehensive analysis. Conversely, if the heterogeneity between studies was P  ≤ 0.1 and I 2  ≥ 50%, the random-effect model was used. We utilized the bivariate mixed-effects model to assess the diagnostic efficacy of the scale, focusing on key evaluation indicators [ 15 ] sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, and diagnostic odds ratio. Samples of the scales included in the evaluation must meet the criteria of the bivariate mixed-effects model analysis, with a minimum sample size of 3 ( n  ≥ 3).

Subgroup analysis can be divided into three subgroups: (1) Depression type, which was divided into any depression group and major depression group. Major depression was defined according to the diagnosis of MDD in DSM-V [ 7 ]: Patients were required to have five or more of nine depressive symptoms lasting more than two weeks after the stroke event, and at least one of them was 1) mood depression or 2) loss of interest or pleasure. The definition of any depression was broader, according to the depressive disorder definition in DSM-III [ 16 ], encompassing adjustment disorder with depressive mood, disorder, and dysthymia. (2) Stroke staging, which was divided into acute phase after stroke (≤ 2 months) and chronic phase after stroke (> 2 months). (3) Specific populations, it includes patients with certain characteristics, such as a comorbid history of pre-stroke depression, stroke with aphasia, cognitive dysfunction, and other features.

This study followed the PRISMA guidelines on reporting [ 17 ]. The screening flowchart was shown in Fig. 1 .Thirty-two studies [ 12 , 13 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 ] involving 3865 people aged between 18 and 92 were included. The relevant information from the studies was presented in Table 1 . The ROBINS-I was used to evaluate the quality of the included literature. The evaluation results were presented in Fig. 2 and Fig. 3 .

figure 1

The flow chart of literature screening

figure 2

Risk of bias and fitness bar chart

figure 3

Summary plot of risk of bias and fitness items

Meta-analysis of scale selection

Sensitivity and specificity of the scales were assessed when the number of articles involved in each scale was two or more ( n  ≥ 2). The study assessed ten scales (PHQ-9, HDRS, MADRS, BDI, GDS, HADS-D, PHQ-2, CES-D, HADS, and PSDS) involving 28 articles. These ten scales had different sensitivities and specificities, and the same scale had different sensitivities and specificities in different studies (Fig.  4 ).

figure 4

Forest plot of sensitivity and specificity for each scale. PHQ-9: Patient Health Questionnaire-9. HDRS: Hamilton Depression Scale. MADRS: Montgomery Asberg Depression Rating Scale. BDI: Beck Depression Inventory. GDS: Geriatric Depression Screening Scale. HADS-D: Hospital Anxiety and Depression Scale-Depression. PHQ-2: The Patient Health Questionnaire-2. CES-D: Center for Epidemiological Studies-Depression. HADS: Hospital Anxiety and Depression Scale. PSDS: Post-Stroke Depression Symptom Inventory

Subgroup analysis

Depression type, any depression.

Five scales were used to assess PSD when depression was classified as any depression in the study. Overall, PHQ-9 had high diagnostic efficacy when both sensitivity and specificity were considered, with a sensitivity of 0.82 (95%CI: 0.72–0.89), specificity 0.87 (95%CI: 0.68–0.95), and diagnostic odds ratio 29 (95%CI: 10.0–84.0); If only higher sensitivity was required, HDRS and MADRS were more advantageous. However, when only higher specificity was considered, PHQ-9 and HADS-D were more advantageous (Table 2 ).

Major depression

When classifying depression as major depression, six scales were used to assess PSD. Overall, when the sensitivity and specificity were considered together, HDRS had a high diagnostic power, with a sensitivity of 0.92 (95%CI: 0.82–0.97), specificity of 0.89 (95%CI: 0.84–0.92), and diagnostic odds ratio of 94 (95%CI: 32–281); Likewise, if only the sensitivity was considered, BDI, HDRS, MADRS had the advantage; but for higher specificity, PHQ-9 and PHQ-2 had the advantage (Table  3 ).

Staging of stroke

Acute phase after stroke.

A total of three scales were used to assess PSD in the acute phase of stroke. PHQ-9 had high diagnostic performance when both sensitivity and specificity were considered, with a sensitivity of 0.85 (95%CI: 0.78–0.91), specificity of 0.90 (95%CI: 0.82–0.95), diagnostic odds ratio of 55 (95%CI: 30–102); If only higher sensitivity was considered, MADRS was more favorable, and if only higher specificity was considered, PHQ-9 was more favorable (Table 4 ).

Chronic phase after stroke

There were eight scales to assess PSD in the chronic phase of stroke. Overall, when high sensitivity and specificity were considered together, HDRS had high diagnostic power, with a sensitivity of 0.94 (95%CI: 0.87–0.98), specificity of 0.85 (95%CI: 0.76–0.91), diagnostic odds ratio of 96 (95%CI: 27–346); If only higher sensitivity was considered, HDRS and BDI had the advantage, on the contrary, if only higher specificity was considered, PHQ-2 and CES-D had the advantage (Table 5 ).

Specific populations

For analysis the specific populations for PSD, 9 out of 32 studies compared the baseline data characteristics of depressed and nondepressed patients after stroke. According to the previous and included data in this study, a total of seven specific populations were analyzed, with clinical features including cognitive impairment, severe aphasia, pre-onset antidepressant medication, first stroke, severity of neurological deficit, educational level, and previous psychiatric history (Table 6 ). However, due to the different inclusion and exclusion criteria and priorities among the original studies, the included data were insufficient, and effective statistical analysis could not be performed.

Prevalence of PSD

The results showed that the prevalence of PSD was approximately 17.0% to 29.0%, and the prevalence of PSD in the acute and chronic phases of stroke was 0.23 (95%CI 0.16–0.32) and 0.25 (95%CI 0.19–0.31), respectively. The prevalence of PSD for any depression and major depression was 0.29 (95%CI 0.23–0.34) and 0.17 (95%CI 0.13–0.22), respectively (Table 7  and Fig.  5 ).

figure 5

Prevalence of post-stroke depression in different stroke periods and depression types (forest plots)

Thirty-two studies were analyzed to determine the best assessment scale for PSD. The results showed that each of these scales (PHQ-9, HDRS, MADRS, BDI, PHQ-2, CES-D, and HADS-D) had different degrees of advantage in diagnosing PSD based on depression type and stroke staging. When evaluating PSD, PHQ-9 exhibits higher diagnostic efficacy for any depression and acute phase after stroke compared to other scales. Conversely, HDRS performs better for major depression and chronic phase after stroke. Due to limitations in the data included in the literature, no effective scale has been found yet to accurately assess PSD patients with combined aphasia and cognitive impairments.

Currently, many studies utilize depression assessment scales for diagnosing PSD. However, controversy remains, as some studies suggest that these scales are not suitable for diagnosing PSD but rather for assessing the severity of depressive symptoms, treatment efficacy, or prognosis [ 48 , 49 ]. Whether a scale can substitute for structured interviews in diagnosing PSD depends on its diagnostic accuracy. Our analysis revealed that PHQ-9 and HDRS performed excellently in identifying depressive symptoms and severity. The PHQ-9 is a self-rating scale consisting of 9 items with high sensitivity and specificity [ 50 , 51 ]. It has been widely used in screening of PSD, because of its simplicity, less time-consuming, and low requirements for patient cooperation. HDRS, introduced in 1960, comprises seven categories, including items for somatic symptoms [ 52 ]. It is well known that in the chronic phase of stroke, many patients experience atypical depressive symptoms, such as gastrointestinal symptoms, weight loss, general pain, fatigue, and other physical discomforts [ 53 ]. HDRS can be used to assess these patients more accurately. Additionally, studies have shown that HDRS is not only uesd to evaluate the severity of PSD, but also to assess the efficacy of antidepressant treatment [ 54 , 55 ].

Burton conducted a review of the scales used for screening post-stroke mood disorders in 2015 [ 56 ]. They focus on mood disorders after stoke, which include various emotions, such as major depression, any degree of depression, or anxiety. Meader also conducted a related meta-analysis in 2014, which included 24 studies involving 2907 patients [ 57 ], the results showed that many scales could screen the PSD, such as CESD, HDRS, and PHQ-9. However, these scales should not be used alone but should be combined with detailed clinical assessments. In comparison to Burton’s and Meader's studies, our study included thirty-two studies, and we provided a clearer description of the stage of stroke and the type of depression for PSD. Additionally, we discussed the selection of scales for PSD in special populations and analyzed the prevalence of PSD.

For the staging of stroke, there is still no unified conclusion at present, and the duration of stroke will affect the symptoms of PSD [ 58 , 59 ]. Some studies recommend assessing PSD at 2 or 8 weeks after stroke, and Toso 's study found that PSD most occurred within 3 months after stroke [ 60 ]. In our study, stroke was staged into the acute phase (within 2 months of stroke onset) and chronic phase (2 months after stroke onset). According to the severity of depression, Robinson classified PSD into mild PSD (mild depression) and severe PSD (severe depression). Mild PSD corresponds to dysthymia in DSM-III, while severe PSD meets the diagnostic criteria for MDD [ 61 ]. Therefore, in this study, PSD was divided into two groups: any depression and major depression, and it should be emphasized that any depression included major depression and mild depression.

This study aimed to analyze which scale was more effective in identifying and assessing depressive symptoms in the specific population with PSD. However, due to the different inclusion and exclusion criteria and priorities among the original studies, the included data were insufficient, and effective statistical analysis could not be performed. Stroke patients often experience complications such as aphasia and cognitive dysfunction, which can exacerbate PSD. A related study found that post-stroke aphasia patients are more likely to suffer from depression than non-aphasia patients [ 62 ]. According to a systematic review by Mariska, there was insufficient evidence supporting the use of a specific scale to evaluate the depressive symptoms in aphasia patients, and the evidence level of existing studies was relatively low [ 63 ]. In addition, relevant studies have shown that post-stroke cognitive impairment (PSCI) was closely related to the occurrence of PSD [ 64 , 65 ]. Impairment oognitive function can affect the evaluation of depressive symptoms to varying degrees. At present, cognitive function scales based on the assessment of Alzheimer's disease are often used in clinical work to assess PSCI, such as Mini-Mental State Examination (MMSE), Montreal Cognitive Assessment Scale (MoCA), and Cambridge Geriatric Cognitive Scale (CAMCOG). However, the organic damage of cerebral parenchyma in stroke patients, along with complications such as aphasia, visual impairment, dyslexia, and limb dysfunction, can impose limitations in the evaluation of PSCI using the aforementioned scales [ 66 , 67 ]. Hence, further research is warranted to determine the most suitable scales for assessing depressive symptoms in patients with post-stroke aphasia and cognitive impairment.

The results of the study revealed that the prevalence of PSD, determined through standard structured interviews, ranged from 17.0% to 29.0%. Previous studies by Ayerbe and Hackett indicated that approximately one-third of stroke patients experienced varying degrees of depression within five years after the stroke event [ 68 , 69 , 70 ]. It is important to note that the assessment of prevalence was primarily conducted using depression scales. Many factors affect the prevalence of the PSD, such as the population, time, and place of assessment. Nowadays, there is a divergence of opinions regarding whether the timing of PSD assessment influences the prevalence of depression. Some studies showed that the prevalence of depression in the acute phase after stroke was higher than in the chronic phase, and the prevalence gradually decreases over time [ 71 , 72 , 73 ], However, another study found no difference in the prevalence of PSD in the early, middle, and late stages of stroke [ 74 ]. Therefore, more high-quality prospective studies will be needed in the future to clarify this issue.

Limitations

There are also some limitations in this study [ 1 ]. This study was a secondary analysis, and the included studies exhibited significant heterogeneity due to variations in diagnostic thresholds for each scale. Additionally, the optimal diagnostic cut-off of each scale was not analyzed, so it needs to clarify in future studies [ 2 ]. Data limitations and mismatches between the original studies hindered subgroup analyses of scale selection, thereby preventing adequate analyses for different types and severity of stroke, aphasia population, the elderly population, individuals with a history of depression, and other populations. In the future, developing more comprehensive research protocols for PSD is crucial.

In conclusion, there are various scales to evaluate PSD. To improve diagnostic effectiveness, a variety of scales can be used for dynamic, multi-directional evaluation and follow-up. The PHQ-9 and HDRS are recommended for the evaluation PSD due to their high diagnostic efficiency. Structured interviews based on diagnostic criteria can determine whether stroke patients have depressive symptoms, and depression scales can further determine the severity of symptoms. It is recommended to replace the structured interviews based on diagnostic criteria with rating scales, such as HDRS, with high diagnostic efficacy. Currently, there is still a lack of depression scales for evaluating patients with post-stroke aphasia and cognitive dysfunction.

Availability of data and materials

No datasets were generated or analysed during the current study.

Abbreviations

Aphasic Depression Rating Scale

Beck Depression Inventory

Cambridge Geriatric Cognitive Scale

Chinese Classification and Diagnostic Standard of Mental Disorders

Center for Epidemiological Studies-Depression

Clinical Global Impression-Scale

Post-Stroke Depression Prediction Scale

Diagnostic and Statistical Manual of Mental Disorders

Geriatric Depression Screening Scale

Hospital Anxiety and Depression Scale

Hospital Anxiety and Depression Scale—Depression

Hamilton Depression Scale

International Classification of Mental Disorders

Montgomery Asberg Depression Rating Scale

Major Depressive Disorder

Medical Subject Headings

Mini-Mental State Examination

Montreal Cognitive Assessment Scale

The Patient Health Questionnaire-2

Patient Health Questionnaire-9

Post-Stroke Cognitive Impairment

  • Post-stroke depression

Post Stroke Depression Rating Scale

Post-Stroke Depression Symptom Inventory

Stroke Aphasic Depression Questionnaire

Self-rating Depression Scale

Signs of Depression Scale

Visual Analog Mood Scale

Visual Analogue self-esteem Scale

GBD 2021 diseases and injuries collaborators. Global incidence, prevalence, years lived with disability (YLDs), disability-adjusted life-years (DALYs), and healthy life expectancy (HALE) for 371 diseases and injuries in 204 countries and territories and 811 subnational locations, 1990-2021: a systematic analysis for the Global burden of disease study 2021. Lancet. 2024;403(10440):2133–61.

Feigin PVL. Global, regional, and national burden of neurological disorders, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet Neurol. 2019;18(5):459–80.

Article   Google Scholar  

Lanctôt KL, Lindsay MP, Smith EE, Sahlas DJ, Foley N, Gubitz G, et al. Canadian stroke best practice recommendations: mood, cognition and fatigue following stroke, 6th edition update 2019. Int J Stroke. 2020;15(6):668–88.

Article   PubMed   Google Scholar  

Towfighi A, Ovbiagele B, El Husseini N, Hackett ML, Jorge RE, Kissela BM, et al. Poststroke depression: a scientific statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. 2017;48(2):e30–43.

Pohjasvaara T, Leskelä M, Vataja R, Kalska H, Ylikoski R, Hietanen M, et al. Post-stroke depression, executive dysfunction and functional outcome. Eur J Neurol. 2002;9(3):269–75.

Article   CAS   PubMed   Google Scholar  

Villa RF, Ferrari F, Moretti A. Post-stroke depression: mechanisms and pharmacological treatment. Pharmacol Ther. 2018;184:131–44.

Battle DE. Diagnostic and Statistical Manual of mental disorders (DSM). Codas. 2013;25(2):191–2.

PubMed   Google Scholar  

Guo J, Wang J, Sun W, Liu X. The advances of post-stroke depression: 2021 update. J Neurol. 2022;269(3):1236–49.

World Health Organization. ICD-10 : international statistical classification of diseases and related health problems : tenth revision, 2nd ed. World Health Organization; 2004. https://iris.who.int/handle/10665/42980 .

Chen YF. Chinese classification and diagnostic criteria of mental disorders, third version. Psych branch Chin Med Assoc. 2001;1:18–19.

Wang SS, Zhou XY, Zhu CY. Chinese expert consensus on clinical practice of post-stroke depression. Chin J Stroke. 2016;11(8):685–93.

Google Scholar  

Yue Y, Liu R, Lu J, Wang X, Zhang S, Wu A, et al. Reliability and validity of a new post-stroke depression scale in Chinese population. J Affect Disord. 2015;174:317–23.

Hirt J, van Meijeren LCJ, Saal S, Hafsteinsdóttir TB, Hofmeijer J, Kraft A, et al. Predictive accuracy of the post-stroke depression prediction scale: a prospective binational observational study ✰. J Affect Disord. 2020;265:39–44.

McGuinness LA, Higgins JPT. Risk-of-bias VISualization (robvis): an R package and Shiny web app for visualizing risk-of-bias assessments. Res Synth Methods. 2021;12(1):55–61.

Zhang TS. Applied methodology for evidence-based medicine. Changsha: Central South University Press; 2014. p. 417–8.

Spitzer RL, Williams JB, Gibbon M, First MB. The Structured Clinical Interview for DSM-III-R (SCID). I: history, rationale, and description. Arch Gener Psychiatry. 1992;49(8):624–9.

Article   CAS   Google Scholar  

Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.

Article   PubMed   PubMed Central   Google Scholar  

Mikami K, Sudo T, Orihashi Y, Kimoto K, Mizuma A, Uesugi T, et al. Effective tools to predict depression in acute and subacute phase of ischemic stroke. J Neuropsychiatry Clin Neurosci. 2021;33(1):43–8.

Dajpratham P, Pukrittayakamee P, Atsariyasing W, Wannarit K, Boonhong J, Pongpirul K. The validity and reliability of the PHQ-9 in screening for post-stroke depression. BMC Psychiatry. 2020;20(1):291.

Wang EY, Meyer C, Graham GD, Whooley MA. Evaluating screening tests for depression in post-stroke older adults. J Geriatr Psychiatry Neurol. 2018;31(3):129–35.

Prisnie JC, Fiest KM, Coutts SB, Patten SB, Atta CA, Blaikie L, et al. Validating screening tools for depression in stroke and transient ischemic attack patients. Int J Psychiatry Med. 2016;51(3):262–77.

Lewin-Richter A, Volz M, Jöbges M, Werheid K. Predictivity of early depressive symptoms for post-stroke depression. J Nutr Health Aging. 2015;19(7):754–8.

Imarhiagbe FA, Owolabi A. Post-stroke depression in a sub-Saharan Africans: validation of the japanese stroke scale for depression. Sahel Med J. 2015;18(3):121.

Lees R, Stott DJ, Quinn TJ, Broomfield NM. Feasibility and diagnostic accuracy of early mood screening to diagnose persisting clinical depression/anxiety disorder after stroke. Cerebrovasc Dis. 2014;37(5):323–9.

Kang HJ, Stewart R, Kim JM, Jang JE, Kim SY, Bae KY, et al. Comparative validity of depression assessment scales for screening poststroke depression. J Affect Disord. 2013;147(1–3):186–91.

Turner A, Hambridge J, White J, Carter G, Clover K, Nelson L, et al. Depression screening in stroke: a comparison of alternative measures with the structured diagnostic interview for the diagnostic and statistical manual of mental disorders, fourth edition (major depressive episode) as criterion standard. Stroke. 2012;43(4):1000–5.

de Man-van Ginkel JM, Hafsteinsdóttir T, Lindeman E, Burger H, Grobbee D, Schuurmans M. An efficient way to detect poststroke depression by subsequent administration of a 9-item and a 2-item Patient Health Questionnaire. Stroke. 2012;43(3):854–6.

Sagen U, Vik TG, Moum T, Mørland T, Finset A, Dammen T. Screening for anxiety and depression after stroke: comparison of the hospital anxiety and depression scale and the Montgomery and Asberg depression rating scale. J Psychosom Res. 2009;67(4):325–32.

Roger PR, Johnson-Greene D. Comparison of assessment measures for post-stroke depression. Clin Neuropsychol. 2009;23(5):780–93.

Berg A, Lönnqvist J, Palomäki H, Kaste M. Assessment of depression after stroke a comparison of different screening instruments. Stroke. 2009;40(2):523–9.

Quaranta D, Marra C, Gainotti G. Mood disorders after stroke: diagnostic validation of the poststroke depression rating scale. Cerebrovasc Dis. 2008;26(3):237–43.

Lee AC, Tang SW, Yu GK, Cheung RT. The smiley as a simple screening tool for depression after stroke: a preliminary study. Int J Nurs Stud. 2008;45(7):1081–9.

Healey AK, Kneebone II, Carroll M, Anderson SJ. A preliminary investigation of the reliability and validity of the Brief Assessment Schedule Depression Cards and the Beck Depression Inventory-Fast Screen to screen for depression in older stroke survivors. Int J Geriatr Psychiatry. 2008;23(5):531–6.

Lightbody CE, Baldwin R, Connolly M, Gibbon B, Jawaid N, Leathley M, et al. Can nurses help identify patients with depression following stroke? A pilot study using two methods of detection. J Adv Nurs. 2007;57(5):505–12.

Laska AC, Mårtensson B, Kahan T, von Arbin M, Murray V. Recognition of depression in aphasic stroke patients. Cerebrovasc Dis. 2007;24(1):74–9.

Williams LS, Brizendine EJ, Plue L, Bakas T, Tu W, Hendrie H, et al. Performance of the PHQ-9 as a screening tool for depression after stroke. Stroke. 2005;36(3):635–8.

Tang WK, Ungvari GS, Chiu HFK, Sze KH. Detecting depression in Chinese stroke patients: a pilot study comparing four screening instruments. Int J Psychiatry Med. 2004;34(2):155–63.

Tang WK, Ungvari GS, Chiu HF, Sze KH, Yu AC, Leung TL. Screening post-stroke depression in Chinese older adults using the hospital anxiety and depression scale. Aging Ment Health. 2004;8(5):397–9.

Tang WK, Chan SS, Chiu HF, Wong KS, Kwok TC, Mok V, et al. Can the Geriatric Depression Scale detect poststroke depression in Chinese elderly? J Affect Disord. 2004;81(2):153–6.

Lincoln NB, Nicholl CR, Flannaghan T, Leonard M, Van der Gucht E. The validity of questionnaire measures for assessing depression after stroke. Clin Rehabil. 2003;17(8):840–6.

Naarding P, Leentjens AF, van Kooten F, Verhey FR. Disease-specific properties of the rating scale for depression in patients with stroke, Alzheimer’s dementia, and Parkinson’s disease. J Neuropsychiatry Clin Neurosci. 2002;14(3):329–34.

Aben I, Verhey F, Lousberg R, Lodder J, Honig A. Validity of the beck depression inventory, hospital anxiety and depression scale, SCL-90, and hamilton depression rating scale as screening instruments for depression in stroke patients. Psychosomatics. 2002;43(5):386–93.

O’Rourke S, MacHale S, Signorini D, Dennis M. Detecting psychiatric morbidity after stroke: comparison of the GHQ and the HAD Scale. Stroke. 1998;29(5):980–5.

Agrell B, Dehlin O. Comparison of six depression rating scales in geriatric stroke patients. Stroke. 1989;20(9):1190–4.

Parikh RM, Eden DT, Price TR, Robinson RG. The sensitivity and specificity of the Center for Epidemiologic Studies Depression Scale in screening for post-stroke depression. Int J Psychiatry Med. 1988;18(2):169–81.

Shinar D, Gross CR, Price TR, Banko M, Bolduc PL, Robinson RG. Screening for depression in stroke patients: the reliability and validity of the Center for Epidemiologic Studies Depression Scale. Stroke. 1986;17(2):241–5.

Yue Y, Liu R, Chen J, Cao Y, Wu Y, Zhang S, et al. The reliability and validity of Post Stroke Depression Scale in different type of Post Stroke Depression patients. J Stroke Cerebrovasc Dis. 2022;31(2):106222.

Yue YY, Yuan YG. Evaluation and diagnosis of post-stroke depression. Pract Geriatr. 2015;29(2):5.

Yuan YG, Jiang HT. Preface-pay attention to the standardized diagnosis and treatment of post-stroke depression. Pract Geriatr. 2015;29(2):91–2.

Trotter TL, Denny DL, Evanson TA. Reliability and validity of the patient health questionnaire-9 as a screening tool for poststroke depression. J Neurosci Nurs. 2019;51(3):147–52.

Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606–13.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Leng HX, Ding WJ, Zhang WR, Wang HX. Progress in assessment of post-stoke depression. Chin J Stroke. 2020;15(7):795–800.

van de Weg FB, Kuik DJ, Lankhorst GJ. Post-stroke depression and functional outcome: a cohort study investigating the influence of depression on functional recovery from stroke. Clin Rehabil. 1999;13(3):268–72.

Hung CYF, Wu XY, Chung VCH, Tang ECH, Wu JCY, Lau AYL. Overview of systematic reviews with meta-analyses on acupuncture in post-stroke cognitive impairment and depression management. Integr Med Res. 2019;8(3):145–59.

Chen YK, Qu JF, Xiao WM, Li WY, Li W, Fang XW, et al. Intracranial atherosclerosis and poststroke depression in Chinese patients with ischemic stroke. J Stroke Cerebrovasc Dis. 2016;25(4):998–1004.

Burton LJ, Tyson S. Screening for mood disorders after stroke: a systematic review of psychometric properties and clinical utility. Psychol Med. 2015;45(1):29–49.

Meader N, Moe-Byrne T, Llewellyn A, Mitchell AJ. Screening for poststroke major depression: a meta-analysis of diagnostic validity studies. J Neurol Neurosurg Psychiatry. 2014;85(2):198–206.

Bernhardt J, Hayward KS, Kwakkel G, Ward NS, Wolf SL, Borschmann K, et al. Agreed definitions and a shared vision for new standards in stroke recovery research: the stroke recovery and rehabilitation roundtable taskforce. Int J Stroke. 2017;12(5):444–50.

Guerra ZF, Lucchetti G. Divergence among researchers regarding the stratification of time after stroke is still a concern. Int J Stroke. 2018;13(4):NP9.

Toso V, Gandolfo C, Paolucci S, Provinciali L, Torta R, Grassivaro N. Post-stroke depression: research methodology of a large multicentre observational study (DESTRO). Neurol Sci. 2004;25(3):138–44.

Robinson RG, Starr LB, Kubos KL, Price TR. A two-year longitudinal study of post-stroke mood disorders: findings during the initial evaluation. Stroke. 1983;14(5):736–41.

Shehata GA, El Mistikawi T, Risha ASK, Hassan HS. The effect of aphasia upon personality traits, depression and anxiety among stroke patients. J Affect Disord. 2015;172:312–4.

van Dijk MJ, de Man-van Ginkel JM, Hafsteinsdóttir TB, Schuurmans MJ. Identifying depression post-stroke in patients with aphasia: a systematic review of the reliability, validity and feasibility of available instruments. Clin Rehabil. 2016;30(8):795–810.

Kauhanen M, Korpelainen JT, Hiltunen P, Brusin E, Mononen H, Määttä R, et al. Poststroke depression correlates with cognitive impairment and neurological deficits. Stroke. 1999;30(9):1875–80.

Williams OA, Demeyere N. Association of depression and anxiety with cognitive impairment 6 months after stroke. Neurology. 2021;96(15):e1966–74.

Nys GMS, van Zandvoort MJE, de Kort PLM, Jansen BPW, Kappelle LJ, de Haan EHF. Restrictions of the Mini-Mental State Examination in acute stroke. Arch Clin Neuropsychol. 2005;20(5):623–9.

Demeyere N, Riddoch MJ, Slavkova ED, Jones K, Reckless I, Mathieson P, et al. Domain-specific versus generalized cognitive screening in acute stroke. J Neurol. 2016;263(2):306–15.

Ayerbe L, Ayis S, Wolfe CD, Rudd AG. Natural history, predictors and outcomes of depression after stroke: systematic review and meta-analysis. Br J Psychiatry. 2013;202(1):14–21.

Kowalska K, Pasinska P, Klimiec-Moskal E, Pera J, Slowik A, Klimkowicz-Mrowiec A, et al. C-reactive protein and post-stroke depressive symptoms. Sci Rep. 2020;10(1):1431.

Hackett ML, Pickles K. Part I: frequency of depression after stroke: an updated systematic review and meta-analysis of observational studies. Int J Stroke. 2014;9(8):1017–25.

Robinson RG. Poststroke depression: prevalence, diagnosis, treatment, and disease progression. Biol Psychiatry. 2003;54(3):376–87.

Yuan HW, Wang CX, Zhang N, Bai Y, Shi YZ, Zhou Y, et al. Poststroke depression and risk of recurrent stroke at 1 year in a Chinese cohort study. PLoS One. 2012;7(10):e46906.

Zhang N, Wang CX, Wang AX, Bai Y, Zhou Y, Wang YL, et al. Time course of depression and one-year prognosis of patients with stroke in mainland China. CNS Neurosci Ther. 2012;18(6):475–81.

Hackett ML, Yapa C, Parag V, Anderson CS. Frequency of depression after stroke: a systematic review of observational studies. Stroke. 2005;36(6):1330–40.

Download references

Acknowledgements

Thanks to Ms. Zhang Fan from Chongqing Medical University for consulting help on the application of statistical methods in the study.

This article was funded by the Chongqing Health Commission (Grant No. 2020MSXM038) and Special support project for clinical research of young and middle-aged doctors in the south of the Five Ridges neurology (Grant No. Z20210305).

Author information

Authors and affiliations.

Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, No.1 Youyi Road, Yuzhong District, Chongqing, 400016, China

Fang Liu, Huan Zhao, Ying-li Li, Zhiwen Yan & Jun Mu

Department of Neurology, Qingdao Eighth People’s Hospital, Qingdao, Shandong, 266000, China

You can also search for this author in PubMed   Google Scholar

Contributions

Jun Mu and Fang Liu designed the study, Fang Liu and Lei Gong collected the data and material, Huan Zhao and Ying-li Li checked the data, Fang Liu analyzed the data and wrote the first draft of the manuscript, Zhiwen Yan gave the advice for analysis the data. And all authors contributed to comment on previous versions of the manuscript, and read and approved the final manuscript.

Corresponding author

Correspondence to Jun Mu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Liu, F., Gong, L., Zhao, H. et al. Validity of evaluation scales for post-stroke depression: a systematic review and meta-analysis. BMC Neurol 24 , 286 (2024). https://doi.org/10.1186/s12883-024-03744-7

Download citation

Received : 14 December 2023

Accepted : 26 June 2024

Published : 15 August 2024

DOI : https://doi.org/10.1186/s12883-024-03744-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Depression Scale
  • Meta-analysis

BMC Neurology

ISSN: 1471-2377

usability evaluation methods a literature review

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • JMIR Med Educ
  • v.8(2); Apr-Jun 2022

Usability Methods and Attributes Reported in Usability Studies of Mobile Apps for Health Care Education: Scoping Review

Susanne grødem johnson.

1 Faculty of Health and Function, Western Norway University of Applied Sciences, Bergen, Norway

Thomas Potrebny

Lillebeth larun.

2 Division of Health Services, Norwegian Institute of Public Health, Oslo, Norway

Donna Ciliska

3 Faculty of Health Sciences, McMaster University, Hamilton, ON, Canada

Nina Rydland Olsen

Associated data.

PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist for reporting scoping reviews.

The search strategies for the 10 databases.

Data extraction sheet.

Mobile devices can provide extendable learning environments in higher education and motivate students to engage in adaptive and collaborative learning. Developers must design mobile apps that are practical, effective, and easy to use, and usability testing is essential for understanding how mobile apps meet users’ needs. No previous reviews have investigated the usability of mobile apps developed for health care education.

The aim of this scoping review is to identify usability methods and attributes in usability studies of mobile apps for health care education.

A comprehensive search was carried out in 10 databases, reference lists, and gray literature. Studies were included if they dealt with health care students and usability of mobile apps for learning. Frequencies and percentages were used to present the nominal data, together with tables and graphical illustrations. Examples include a figure of the study selection process, an illustration of the frequency of inquiry usability evaluation and data collection methods, and an overview of the distribution of the identified usability attributes. We followed the Arksey and O’Malley framework for scoping reviews.

Our scoping review collated 88 articles involving 98 studies, mainly related to medical and nursing students. The studies were conducted from 22 countries and were published between 2008 and 2021. Field testing was the main usability experiment used, and the usability evaluation methods were either inquiry-based or based on user testing. Inquiry methods were predominantly used: 1-group design (46/98, 47%), control group design (12/98, 12%), randomized controlled trials (12/98, 12%), mixed methods (12/98, 12%), and qualitative methods (11/98, 11%). User testing methods applied were all think aloud (5/98, 5%). A total of 17 usability attributes were identified; of these, satisfaction, usefulness, ease of use, learning performance, and learnability were reported most frequently. The most frequently used data collection method was questionnaires (83/98, 85%), but only 19% (19/98) of studies used a psychometrically tested usability questionnaire. Other data collection methods included focus group interviews, knowledge and task performance testing, and user data collected from apps, interviews, written qualitative reflections, and observations. Most of the included studies used more than one data collection method.

Conclusions

Experimental designs were the most commonly used methods for evaluating usability, and most studies used field testing. Questionnaires were frequently used for data collection, although few studies used psychometrically tested questionnaires. The usability attributes identified most often were satisfaction, usefulness, and ease of use. The results indicate that combining different usability evaluation methods, incorporating both subjective and objective usability measures, and specifying which usability attributes to test seem advantageous. The results can support the planning and conduct of future usability studies for the advancement of mobile learning apps in health care education.

International Registered Report Identifier (IRRID)

RR2-10.2196/19072

Introduction

Mobile devices can provide extendable learning environments and motivate students to engage in adaptive and collaborative learning [ 1 , 2 ]. Mobile devices offer various functions, enable convenient access, and support the ability to share information with other learners and teachers [ 3 ]. Most students own a mobile phone, which makes mobile learning easily accessible [ 4 ]. However, there are some challenges associated with mobile devices in learning situations, such as small screen sizes, connectivity problems, and multiple distractions in the environment [ 5 ].

Developers of mobile learning apps need to consider usability to ensure that apps are practical, effective, and easy to use [ 1 ] and to ascertain that mobile apps meet users’ needs [ 6 ]. According to the International Organization for Standardization, usability is defined as “the extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use” [ 7 ]. Better mobile learning usability will be achieved by focusing on user-centered design and attention to context, ensuring that the technology corresponds to the user’s requirements and putting the user at the center of the process [ 8 , 9 ]. In addition, it is necessary to be conscious of the interrelatedness between usability and pedagogical design [ 9 ].

A variety of usability evaluation methods exists to test the usability of mobile apps, and Weichbroth [ 10 ] categorized them into the following 4 categories: inquiry, user testing, inspection, and analytical modeling. Inquiry methods are designed to gather data from users through questionnaires (quantitative data) and interviews and focus groups (qualitative data). User testing methods include think-aloud protocols, question-asking protocols, performance measurements, log analysis, eye tracking, and remote testing. Inspection methods, in contrast, involve experts testing apps, heuristic evaluation, cognitive walk-through, perspective-based inspections, and guideline reviews. Analytical modeling methods include cognitive task analysis and task environment analysis [ 10 ]. Across these 4 usability evaluation methods, the most commonly used data collection methods are controlled observations and surveys, whereas eye tracking, think-aloud methods, and interviews are applied less often [ 10 ].

Usability evaluations are normally performed in a laboratory or in field testing. Previous reviews have reported that usability evaluation methods are mainly conducted in a laboratory, which means in a controlled environment [ 1 , 11 ]. By contrast, field testing is conducted in real-life settings. There are pros and cons to the 2 different approaches. Field testing allows data collection within a dynamic environment, whereas in a laboratory data collection and conditions are easier to control [ 1 ]. A variety of data collection methods are appropriate for usability studies; for instance, in laboratories, participants performing predefined tasks, such as using questionnaires and observations, are often applied [ 1 ]. In field testing, logging mechanisms and diaries have been applied to capture user interaction with mobile apps [ 1 ].

In all, 2 systematic reviews examined various psychometrically tested usability questionnaires as a means of enhancing the usability of apps. Sousa and Lopez [ 12 ] identified 15 such questionnaires and Sure [ 13 ] identified 13. In all, 5 of the questionnaires have proven to be applicable in usability studies in general: the System Usability Scale (SUS), Questionnaire for User Interaction Satisfaction, After-Scenario Questionnaire, Post-Study System Usability Questionnaire, and Computer System Usability Questionnaire [ 12 ]. The SUS questionnaire and After-Scenario Questionnaire are most widely applied [ 13 ]. The most frequently reported usability attributes of these 5 questionnaires are learnability, efficiency, and satisfaction [ 12 ].

Usability attributes are features that measure the quality of mobile apps [ 1 ]. The most commonly reported usability attributes are effectiveness, efficiency, and satisfaction [ 5 ], which are part of the usability definition [ 7 ]. In the review by Weichbroth [ 10 ], 75 different usability attributes were identified. Given the wide selection of usability attributes, choosing appropriate attributes depends on the nature of the technology and the research question in the usability study [ 14 ]. Kumar and Mohite [ 1 ] recommended that researchers present and explain which usability attributes are being tested when mobile apps are being developed.

Previous reviews have examined the usability of mobile apps in general [ 5 , 10 , 11 , 14 , 15 ]; however, only one systematic review has specifically explored the usability of mobile learning apps [ 1 ]. However, studies from health care education were not included. Similarly, usability has not been widely explored in medical education apps [ 16 ]. Thus, there is a need to develop a better understanding of how the usability of mobile learning apps developed for health care education has been evaluated and conceptualized in previous studies.

The aim of this scoping review has therefore been to identify usability methods and attributes in usability studies of mobile apps for health care education.

We have used the framework for scoping reviews developed by Arksey and O'Malley [ 17 ] and further developed by Levac et al [ 18 ] and Khalil et al [ 19 ]. We adopted the following five stages of this framework: (1) identifying the research question, (2) identifying relevant studies, (3) selecting studies, (4) charting the data, and (5) summarizing and reporting the results [ 17 - 19 ]. A detailed presentation of each step can be found in the published protocol for this scoping review [ 20 ]. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist for reporting scoping reviews ( Multimedia Appendix 1 [ 21 ]).

Stage 1: Identifying the Research Question

The following two research questions have been formulated:

  • Which usability methods are used to evaluate the usability of mobile apps for health care education?
  • Which usability attributes are reported in the usability studies of mobile apps for health care education?

Stage 2: Identifying Relevant Studies

A total of 10 electronic databases on technology, education, and health care from January 2008 to October 2021 and February 2022 were searched. These databases were as follows: Engineering Village, Scopus, ACM Digital Library, IEEE Xplore, Education Resource Information Center, PsycINFO, CINAHL, MEDLINE, EMBASE, and Web of Science. The search string was developed by the first author and a research librarian and then peer reviewed by another research librarian. The search terms used in the Web of Science, in addition to all relevant subject headings, included: ((student* or graduate* or undergraduate* or postgraduate*) NEAR/3 nurs*) . This search string was repeated for other types of students and combined with the Boolean operator OR. The search string for all types of health care students was then combined with various search terms for mobile apps and mobile learning using the Boolean operator AND. Similar search strategies were used and adapted for all 10 databases as shown in Multimedia Appendix 2 . In addition, a citation search in Google Scholar, screening reference lists of included studies, and searching for gray literature in OpenGrey were conducted.

Stage 3: Selecting Studies

Two of the authors independently screened titles and abstracts using Rayyan web-based management software [ 22 ]. Studies deemed eligible by one of the authors were included for full-text screening and imported into the EndNote X9 (Clarivate) reference management system [ 23 ]. Eligibility for full-text screening was determined independently by two of the authors and disagreements were resolved by consensus-based discussions. Research articles with different designs were included, and there were no language restrictions. As mobile apps started appearing in 2008, this year was set as the starting point for the search. Eligibility criteria are presented in Table 1 .

Study eligibility.


Inclusion criteriaExclusion criteria
PopulationHealth care and allied health care students at the undergraduate and postgraduate levelsHealth care professionals or students from education, engineering, or other nonhealth sciences
ConceptStudies of usability testing or methods of usability evaluation of mobile learning apps where the purpose relates to the development of the appsStudies relating to learner management systems, e-learning platforms, open online courses, or distance education
ContextTypical educational setting (eg, classroom teaching, clinical placement, or simulation training), including both synchronous and asynchronous teachingNoneducational settings not involving clinical placement or learning situations (eg, hospital or community settings)

Stage 4: Charting the Data (Data Abstraction)

The extracted data included information about the study (eg, authors, year of publication, title, and country), population (eg, number of participants), concepts (usability methods, usability attributes, and usability phase), and context (educational setting). The final data extraction sheet can be found in Multimedia Appendix 3 [ 24 - 111 ]. One review author extracted the data from the included studies using Microsoft Excel software [ 21 ], which was checked by another researcher.

Descriptions of usability attributes have not been standardized, making categorization challenging. Therefore, a review author used deductive analysis to interpret the usability attributes reported in the included studies. This interpretation was based on a review of usability attributes as defined in previous literature. These definitions were assessed on the basis of the results of the included studies. This analysis was reviewed and discussed by another author. Disagreements were resolved through a consensus-based discussion.

Stage 5: Summarizing and Reporting the Results

Frequencies and percentages were used to present nominal data, together with tables and graphical illustrations. For instance, a figure showing the study selection process, an illustration of the frequency of inquiry-based usability evaluation and data collection methods, and an overview of the distribution of identified usability attributes were provided.

Eligible Studies

Database searches yielded 34,369 records, and 2796 records were identified using other methods. After removing duplicates, 28,702 records remained. A total of 626 reports were examined in full text. In all, 88 articles were included in the scoping review [ 24 - 111 ] ( Figure 1 ). A total of 8 articles comprised results from several studies in the same article, presented as study A, study B, or study C in Multimedia Appendix 3 . Therefore, a total of 98 studies were reported in the 88 articles included.

An external file that holds a picture, illustration, etc.
Object name is mededu_v8i2e38259_fig1.jpg

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart of study selection process.

The included studies comprised a total sample population of 7790, with participant numbers ranging from 5 to 736 participants per study. Most of the studies included medical students (34/88, 39%) or nursing students (25/88, 28%). Other participants included students from the following disciplines: pharmacy (9/88, 10%), dentistry (5/88, 6%), physiotherapy (5/88, 6%), health sciences (3/88, 3%), and psychology (2/88, 2%). Further information is provided in Multimedia Appendix 3 . There were 22 publishing countries, with most studies being from the United States (22/88, 25%), Spain (9/88, 10%), the United Kingdom (8/88, 9%), Canada (7/88, 8%), and Brazil (7/88, 8%), with an increasing number of publications from 2014. Table 2 provides an overview and characteristics of the included articles.

Characteristics of included articles.

Study numberStudyPopulation (N)Research design: data collection methodUsability attributes
1Aebersold et al [ ], 2018, United StatesNursing (N=69)Mixed methods: questionnaire; task and knowledge performance Ease of use; learning performance; satisfaction; usefulness
2Akl et al [ ], 2008, United StatesResident (N=30)Qualitative methods: focus groups; written qualitative reflectionsSatisfaction
3Al-Rawi et al [ ], 2015, United StatesDentist (N=61)Posttest 1-group design: questionnaireEase of use; frequency of use; satisfaction; usefulness
4Albrecht et al [ ], 2013, GermanyMedicine (N=6)Posttest 1-group design: questionnaire Satisfaction
5Alencar Neto et al [ ], 2020, BrazilMedicine (N=132)Posttest 1-group design: questionnaire Ease of use; learnability; satisfaction; usefulness
6Alepis and Virvou [ ], 2010, GreeceMedicine (N=110)Mixed methods: questionnaire; interviewsEase of use; usefulness; user-friendliness
7Ameri et al [ ], 2020, IranPharmacy (N=241)Posttest 1-group design: questionnaire Context of use; efficiency; usefulness
8Balajelini and Ghezeljeh [ ], 2018, IranNursing (N=41)Posttest 1-group design: questionnaireEase of use; frequency of use; navigation; satisfaction; simplicity; usefulness
9Barnes et al [ ], 2015, United KingdomMedicine (N=42)Randomized controlled trial: questionnaire; task and knowledge performanceEase of use; effectiveness; learning performance; satisfaction
10Busanello et al [ ], 2015, BrazilDentist (N=62)Pre-post test, nonrandomized control group design: questionnaire Learnability; learning performance; satisfaction
11Cabero-Almenara and Roig-Vila [ ], 2019, SpainMedicine (N=50)Pre-post test, 1-group design: questionnaire Learning performance; satisfaction
12Choi et al [ ], 2015, South KoreaNursing (N=5)Think-aloud methods: interviews; data from appContext of use; ease of use; learnability; satisfaction; usefulness
13Choi et al [ ], 2018, South KoreaNursing (N=75)Pre-post test, nonrandomized control group design: questionnaireEase of use; learning performance; satisfaction; usefulness
14Choo et al [ ], 2019, SingaporePsychology (N=8)Mixed methods: questionnaire ; written qualitative reflectionsEase of use; learning performance; satisfaction; usefulness; user-friendliness
15Chreiman et al [ ], 2017, United StatesMedicine (N=30)Posttest 1-group design: questionnaire; data from appContext of use; ease of use; frequency of use; usefulness
16Colucci et al [ ], 2015, United StatesMedicine (N=115)Posttest 1-group design: questionnaireEffectiveness; efficiency; satisfaction; usefulness
17Davids et al [ ], 2014, South AfricaResidents (N=82)Randomized controlled trial: questionnaire ; data from appEffectiveness; efficiency; learnability; navigation; satisfaction; user-friendliness
18ADemmans et al [ ], 2018, CanadaNursing (N=60)Pre-post test, nonrandomized control group design: questionnaire; observationsEase of use; effectiveness; learnability; learning performance; navigation; satisfaction
18BDemmans et al [ ], 2018, CanadaNursing (N=85)Pre-post test, nonrandomized control group design: questionnaire; observationsEase of use; effectiveness; learnability; learning performance; navigation; satisfaction
19Devraj et al [ ], 2021, United StatesPharmacy (N=89)Posttest 1-group design: questionnaire; data from appEase of use; errors; frequency of use; learning performance; navigation; operational usability; satisfaction; usefulness
20Díaz-Fernández et al [ ], 2016, SpainPhysiotherapy (N=110)Posttest 1-group design: questionnaireComprehensibility; ease of use; usefulness
21Docking et al [ ], 2018, United KingdomParamedic (N=24)Think-aloud methods: focus groupsContext of use; learnability; satisfaction; usefulness
22Dodson and Baker [ ], 2020, United StatesNursing (N=23)Qualitative methods: focus groupsEase of use; operational usability; satisfaction; usefulness; user-friendliness
23Duarte Filho et al [ ], 2014, BrazilMedicine (N=10)Posttest nonrandomized control group design: questionnaireEase of use; efficiency; satisfaction; usefulness
24Duggan et al [ ], 2020, CanadaMedicine (N=80)Posttest 1-group design: questionnaire; data from appEase of use; frequency of use; satisfaction; usefulness
25Fernandez-Lao et al [ ], 2016, SpainPhysiotherapy (N=49)Randomized controlled trial: questionnaire ; task and knowledge performanceLearning performance; satisfaction
26Fralick et al [ ], 2017, CanadaMedicine (N=62)Pre-post test, nonrandomized control group design: questionnaireEase of use; frequency of use; learning performance; usefulness
27Ghafari et al [ ], 2020, IranNursing (N=8)Posttest 1-group design: questionnaireEase of use; operational usability; satisfaction; usefulness
28Goldberg et al [ ], 2014, United StatesMedicine (N=18)Posttest 1-group design: questionnaire; task and knowledge performanceEase of use; effectiveness
29Gutiérrez-Puertas et al [ ], 2021, SpainNursing (N=184)Randomized controlled trial: questionnaire; task and knowledge performanceLearning performance; satisfaction
30Herbert et al [ ], 2021, United StatesNursing (N=33)Randomized controlled trial: questionnaire; task and knowledge performanceEase of use; learning performance; navigation; operational usability; usefulness
31Hsu et al [ ], 2019, TaiwanNursing (N=16)Qualitative methods: interviewsContext of use; operational usability; satisfaction; usefulness
32Huang et al [ ], 2010, TaiwanNot clear (N=28)Posttest 1-group design: questionnaireEase of use; satisfaction, usefulness
33Hughes and Kearney [ ], 2017, United StatesOccupational therapy (N=19)Qualitative methods: focus groupsEfficiency; satisfaction
34Ismail et al [ ], 2018, MalaysiaHealth science (N=124)Pre-post test, 1-group design: questionnaireEase of use; learning performance; satisfaction; user-friendliness
35Johnson et al [ ], 2021, NorwayOccupational therapy, physiotherapy, and social education (N=15)Qualitative methods: focus groupsContext of use; ease of use; operational usability
36AKang Suh [ ], 2018, South KoreaNursing (N=92)Pre-post test, nonrandomized control group design: questionnaire; data from appEffectiveness; frequency of use; learning performance; satisfaction
36BKang Suh [ ], 2018, South KoreaNursing (N=49)Qualitative methods: focus groupsEffectiveness; frequency of use; learning performance; satisfaction
37Keegan et al [ ], 2016, United StatesNursing (N=116)Posttest nonrandomized control group design: questionnaire; task and knowledge performanceLearning performance; satisfaction; usefulness
38Kim-Berman et al [ ], 2019, United StatesDentist (N=93)Posttest 1-group design: questionnaire; task and knowledge performanceContext of use; ease of use; effectiveness; usefulness
39Kojima et al [ ], 2011, JapanPhysiotherapy and occupational therapy (N=41)Pre-post test, 1-group design: questionnaireEase of use; learning performance; satisfaction; usefulness
40Koulias et al [ ], 2012, AustraliaMedicine (N=171)Posttest 1-group design: questionnaireEase of use; operational usability; satisfaction
41Kow et al [ ], 2016, SingaporeMedicine (N=221)Pre-post test, 1-group design: questionnaireLearning performance; satisfaction
42Kurniawan and Witjaksono [ ], 2018, IndonesiaMedicine (N=30)Posttest 1-group design: questionnaireSatisfaction; usefulness
43ALefroy et al [ ], 2017, United KingdomMedicine (N=21)Qualitative methods: focus groups; data from appContext of use; frequency of use; satisfaction
43BLefroy et al [ ], 2017, United KingdomMedicine (N=405)Quantitative methods: data from appContext of use; frequency of use; satisfaction
44Li et al [ ], 2019, TaiwanHealth care (N=70)Pre-post test, nonrandomized control group design: questionnaire Ease of use; usefulness
45Lin and Lin [ ], 2016, TaiwanNursing (N=36)Pre-post test, nonrandomized control group design: questionnaireCognitive load; ease of use; learnability; learning performance; usefulness
46Lone et al [ ], 2019, IrelandDentist (N=59)Randomized controlled trial: questionnaire; task and knowledge performanceEase of use; learnability; learning performance; operational usability; satisfaction
47ALong et al [ ], 2016, United StatesNursing (N=158)Pre-post test, 1-group design: questionnaire; data from appEase of use; efficiency; learnability; learning performance; satisfaction
47BLong et al [ ], 2016, United StatesHealth science (N=159)Randomized controlled trial: questionnaire; data from appEase of use; efficiency; learnability; learning performance; satisfaction
48Longmuir [ ], 2014, United StatesMedicine (N=56)Posttest 1-group design: questionnaire; data from appEfficiency; learnability; operational usability; satisfaction
49López et al [ ], 2016, SpainMedicine (N=67)Posttest 1-group design: questionnaire Context of use; ease of use; errors; satisfaction; usefulness
50Lozano-Lozano et al [ ], 2020, SpainPhysiotherapy (N=110)Randomized controlled trial: questionnaire; task and knowledge performanceLearning performance; satisfaction; usefulness
51Lucas et al [ ], 2019, AustraliaPharmacy (N=39)Pre-post test, 1-group design: questionnaire; task and knowledge performanceSatisfaction; usefulness
52Mathew et al [ ], 2014, CanadaMedicine (N=5)Think-aloud methods: questionnaire ; interviews; task and knowledge performanceLearnability; satisfaction
53McClure [ ], 2019, United StatesNursing (N=16)Posttest 1-group design: questionnaire Learnability; satisfaction; usefulness
54McDonald et al [ ], 2018, CanadaMedicine (N=20)Pre-post test, 1-group design: questionnaire; data from appEffectiveness; satisfaction
55McLean et al [ ], 2014, AustraliaMedicine (N=58)Mixed methods: questionnaire; focus groups; interviewsSatisfaction
56McMullan [ ], 2018, United KingdomHealth science (N=60)Pre-post test, 1-group design: questionnaireLearning performance; navigation; satisfaction; usefulness; user-friendliness
57Mendez-Lopez et al [ ], 2021, SpainPsychology (N=67)Pre-post test, 1-group design: questionnaire; task and knowledge performanceCognitive load; ease of use; learning performance; satisfaction; usefulness
58Meruvia-Pastor et al [ ], 2016, CanadaNursing (N=10)Pre-post test, 1-group design: questionnaire; task and knowledge performanceEase of use; learning performance; satisfaction; usefulness
59Mettiäinen [ ], 2015, FinlandNursing (N=121)Mixed methods: questionnaire; focus groupsEase of use; usefulness
60Milner et al [ ], 2020, United StatesMedicine and nursing (N=66)Posttest 1-group design: questionnaireSatisfaction; usefulness
61Mladenovic et al [ ], 2021, SerbiaDentist (N=56)Posttest 1-group design: questionnaireContext of use; ease of use; satisfaction; usefulness
62Morris and Maynard [ ], 2010, United KingdomPhysiotherapy and nursing (N=19)Pre-post test, 1-group design: questionnaireContext of use; ease of use; navigation; operational usability; usefulness
63ANabhani et al [ ], 2020, United KingdomPharmacy (N=56)Posttest 1-group design: questionnaireEase of use; learnability; learning performance; satisfaction; usefulness
63BNabhani et al [ ],
2020, United Kingdom
Pharmacy (N=152)Posttest 1-group design: questionnaireEase of use; learnability; learning performance; satisfaction; usefulness
63CNabhani et al [ ],
2020, United Kingdom
Pharmacy (N=33)Posttest 1-group design: task and knowledge performanceEase of use; learnability; learning performance; satisfaction; usefulness
64ANoguera et al [ ], 2013, SpainPhysiotherapy (N=84)Posttest 1-group design: questionnaireLearning performance; satisfaction; usefulness
64BNoguera et al [ ], 2013, SpainPhysiotherapy (N=76)Randomized controlled trial: questionnaireLearning performance; satisfaction; usefulness
65O’Connell et al [ ], 2016, IrelandMedicine, nursing, and pharmacy (N=89)Randomized controlled trial: questionnaire Ease of use; learning performance; operational usability; satisfaction; simplicity
66Oliveira et al [ ], 2019, BrazilMedicine (N=110)Randomized controlled trial: questionnaire; task and knowledge performanceFrequency of use; learning performance; satisfaction
67Orjuela et al [ ], 2015, ColombiaMedicine (N=22)Posttest 1-group design: questionnaireEase of use; satisfaction
68Page et al [ ], 2016, United StatesMedicine (N=356)Mixed methods: questionnaire; interviewsContext of use; efficiency; satisfaction
69Paradis et al [ ], 2018, CanadaMedicine and nursing (N=108)Posttest 1-group design: questionnaire Ease of use; satisfaction; usefulness
70Pereira et al [ ], 2017, BrazilMedicine (N=20)Posttest 1-group design: questionnaire Ease of use; learnability; satisfaction; usefulness
71Pereira et al [ ], 2019, BrazilNursing (N=60)Posttest 1-group design: questionnaireEase of use; operational usability; satisfaction
72APinto et al [ ], 2008, BrazilBiomedical informatics (N=5)Qualitative methods: observations; task and knowledge performanceEfficiency; errors; learnability; learning performance; operational usability; satisfaction
72BPinto et al [ ], 2008, BrazilMedicine (N=not clear)Posttest nonrandomized control group design: questionnaireEfficiency; errors; learnability; learning performance; operational usability; satisfaction
73Quattromani et al [ ], 2018, United StatesNursing (N=181)Randomized controlled trial: questionnaire Learnability; learning performance; satisfaction; usefulness
74Robertson and Fowler [ ], 2017, United StatesMedicine (N=18)Qualitative methods: focus groupsSatisfaction
75ARomero et al [ ], 2021, GermanyMedicine (N=22)Think-aloud methods: questionnaire; interviews; task and knowledge performanceEffectiveness; efficiency; errors; navigation; satisfaction
75BRomero et al [ ], 2021, GermanyMedicine (N=22)Posttest 1-group design: questionnaire Learnability; satisfaction
75CRomero et al [ ], 2021, GermanyMedicine (N=736)Posttest 1-group design: questionnaireFrequency of use; satisfaction
76Salem et al [ ], 2020, AustraliaPharmacy (N=33)Posttest 1-group design: questionnaireOperational usability; satisfaction; usefulness
77San Martín-Rodríguezet al [ ], 2020, SpainNursing (N=77)Posttest 1-group design: questionnaire; task and knowledge performanceLearning performance; operational usability; satisfaction
78Schnepp and Rogers [ ], 2017, United StatesNot clear (N=72)Think-aloud methods: questionnaire ; interviews; task and knowledge performanceLearnability; satisfaction
79Smith et al [ ], 2016, United KingdomMedicine and nursing (N=74)Mixed methods: questionnaire; focus groupsNavigation; operational usability; satisfaction; user-friendliness
80Strandell-Laine et al [ ], 2019, FinlandNursing (N=52)Mixed methods: questionnaire ; written qualitative responsesLearnability; operational usability; satisfaction
81Strayer et al [ ], 2010, United StatesMedicine (N=122)Mixed methods: questionnaire; focus groupsContext of use; learnability; learning performance; satisfaction; usefulness
82Taylor et al [ ], 2010, United KingdomA total of 8 different health care educations (N=79)Qualitative methods: focus groups; written qualitative reflectionsContext of use; learnability
83Toh et al [ ], 2014, SingaporePharmacy (N=31)Posttest 1-group design: questionnaireEase of use; learnability; navigation; usefulness
84Tsopra et al [ ], 2020, FranceMedicine (N=57)Mixed methods: questionnaire; focus groupsEase of use; operational usability; satisfaction; usefulness
85Wu [ ], 2014, TaiwanNursing (N=36)Mixed methods: questionnaire; interviewsCognitive load; effectiveness; satisfaction; usefulness
86Wyatt et al [ ], 2012, United StatesNursing (N=12)Qualitative methods: focus groupsEase of use; efficiency; errors; learnability; memorability; navigation; satisfaction
87Yap [ ], 2017, SingaporePharmacy (N=123)Posttest 1-group design: questionnaireComprehensibility; learning performance; memorability; navigation; satisfaction; usefulness
88Zhang et al [ ], 2015, SingaporeMedicine (N=185)Mixed methods: questionnaire; focus groupsUsefulness

a Performances measured, comparing paper and app results, quiz results, and exam results.

b Reported use of validated questionnaires.

Usability Evaluation Methods

The usability evaluation methods found were either inquiry-based or based on user testing. The following inquiry methods were used: 1-group design (46/98, 47%), control group design (12/98, 12%), randomized controlled trials (12/98, 12%), mixed methods (12/98, 12%), and qualitative methods (11/98, 11%). Several studies that applied inquiry-based methods used more than one data collection method, with questionnaires being used most often (80/98, 82%), followed by task and knowledge performance testing (17/98, 17%), focus groups (15/98, 15%), collection of user data from the app (10/98, 10%), interviews (5/98, 5%), written qualitative reflections (4/98, 4%), and observations (3/98, 3%). Additional information can be found in the data extraction sheet ( Multimedia Appendix 3 ). Figure 2 illustrates the frequency of the inquiry-based usability evaluation methods and data collection methods.

An external file that holds a picture, illustration, etc.
Object name is mededu_v8i2e38259_fig2.jpg

Inquiry usability evaluation methods and data collection methods.

The only user testing methods found were think-aloud methods (5/98, 5%), and 4 (80%) of these studies applied more than one data collection method. The data collection methods used included interviews (4/98, 4%), questionnaires (3/98, 3%), task and knowledge performance (3/98, 3%), focus groups (1/98, 1%), and collection of user data from the app (1/98, 1%).

A total of 19 studies used a psychometrically tested usability questionnaire, including the SUS, Technology Acceptance Model, Technology Satisfaction Questionnaire, and Technology Readiness Index. SUS [ 112 ] was used in most (9/98, 9%) of the studies.

Field testing was the most frequent type of usability experiment, accounting for 72% (71/98) of usability experiments. A total of 22 (22%) studies performed laboratory testing, and 5 (5%) studies did not indicate the type of experiment performed. Multimedia Appendix 3 provides an overview of the type of experiment conducted in each study. The usability testing of the mobile apps took place in a classroom setting (41/98, 42%), in clinical placement (29/98, 30%), during simulation training (14/98, 14%), other (7/98, 7%), or the setting was not specified (5/98, 5%).

Usability Attributes

A total of 17 usability attributes have been identified among the included studies. The most frequently identified attributes were satisfaction, usefulness, ease of use, learning performance, and learnability. The least frequent were errors, cognitive load, comprehensibility, memorability, and simplicity. Table 3 provides an overview of the usability attributes identified in the included studies.

Distribution of usability attributes (n=17) and affiliated reports (N=88).

Usability attributeDistribution, n (%)Reports (references)
Satisfaction74 (84)[ - , - , - , - , , , - , , , - , - , , , - , - ]
Usefulness51 (58)[ , , - , - , - , , , - , - , , , , - , , - , , , , , , - , , ]
Ease of use45 (51)[ , , , , , , - , - , - , - , , , , , - , - , , - , - , , , - , , , ]
Learning performance33 (38)[ , - , , , , , , , , , , , , , , - , , - , - , , , , , ]
Learnability23 (26)[ , , , , , , - , , , , , , , , , - , ]
Operational usability19 (22)[ , , , , , , , , , , , , , , - , , ]
Context of use14 (16)[ , , , , , , , , , , , , , ]
Navigation12 (14)[ , - , , , , , , , , ]
Efficiency11 (13)[ , , , , , , , , , , ]
Effectiveness10 (11)[ , - , , , , , , ]
Frequency of use10 (11)[ , , , , , , , , , ]
User-friendliness7 (8)[ , , , , , , ]
Errors5 (6)[ , , , , ]
Cognitive load3 (3)[ , , ]
Comprehensibility2 (2)[ , ]
Memorability2 (2)[ , ]
Simplicity2 (2)[ , ]

Principal Findings

This scoping review sought to identify the usability methods and attributes reported in usability studies of mobile apps for health care education. A total of 88 articles, with a total of 98 studies reported in these 88 articles, were included in this review. Our findings indicate a steady increase in publications from 2014, with studies being published in 22 different countries. Field testing was used more frequently than laboratory testing. Furthermore, the usability evaluation methods applied were either inquiry-based or based on user testing. Most of the inquiry-based methods were experiments that used questionnaires as a data collection method, and all of the studies with user testing methods applied think-aloud methods. Satisfaction, usefulness, ease of use, learning performance, and learnability were the most frequently identified usability attributes.

Comparison With Prior Work

The studies included in this scoping review mainly applied inquiry-based methods, primarily the collection of self-reported data through questionnaires. This is congruent with the results of Weichbroth [ 10 ], in which controlled observations and surveys were the most frequently applied methods. Asking users to respond to a usability questionnaire may provide relevant and valuable information. Among the 83 studies that used questionnaires in our review, only 19 (23%) used a psychometrically tested usability questionnaire; of these, the SUS questionnaire [ 112 ] was used most frequently. In line with the review on usability questionnaires [ 12 ], we recommend using a psychometrically tested usability questionnaire to support the advancement of usability science. As questionnaires address only certain usability attributes, mainly learnability, efficiency, and satisfaction [ 12 ], it would be helpful to also include additional methods, such as interviews or mixed methods, and to incorporate additional open-ended questions when using questionnaires.

Furthermore, the application of usability evaluation methods other than inquiry methods, such as user testing methods and inspection methods [ 10 ], could be beneficial and lead to more objective measures of app usability. Among other things, subjective data are collected via self-reported questionnaires, and objective data are collected based on task completion rates [ 40 ]. For example, in one of the included studies, the participants reported that the usability of the app was satisfactory by subjective measures, but the participants did not use the app [ 75 ]. Another study reported a lack of coherence between subjective and objective data; thus, these results indicate the importance of not relying solely on subjective measures of usability [ 40 ]. Therefore, it is suggested that various usability evaluation methods, including subjective and objective usability measures, are used in future usability studies.

Our review found that most of the included studies in health care education (71/98, 72%) performed field testing, whereas previous literature suggests that usability experiments in other fields are more often conducted in a laboratory [ 1 , 113 ]. For instance, Kumar and Mohite [ 1 ] found that 73% of the studies included in their review of mobile learning apps used laboratory testing. Mobile apps in health care education have been developed to support students’ learning, on-campus and during clinical placement, in various settings and on the move. Accordingly, it is especially important to test how the apps are perceived in specific environments [ 5 ]; hence, field testing is required. However, many usability issues can be discovered in a laboratory. Particularly in the early phases of app development, testing an app with several participants in a laboratory may make it more feasible to test and improve the app [ 8 ]. Usability testing in a laboratory can provide rapid feedback on usability issues, which can then be addressed before testing the app in a real-world environment. Therefore, it may be beneficial to conduct small-scale laboratory testing before field testing.

Previous systematic reviews of mobile apps in general identified satisfaction, efficiency, and effectiveness as the most common usability attributes [ 5 , 10 ]. In this review, efficiency and effectiveness were explored to a limited extent, whereas satisfaction, usefulness, and ease of use were the most frequently identified usability attributes. Our results coincide with those from a previous review on the usability of mobile learning apps [ 1 ], possibly because satisfaction, usefulness, and ease of use are usability attributes of particular importance when examining mobile learning apps.

Learning performance was assessed frequently in the included studies. For ensuring that apps are valuable in a given learning context, it is relevant to test additional usability attributes such as cognitive load [ 9 ]. However, few studies included in our review examined cognitive load [ 68 , 80 , 108 ]. Mobile apps are often used in an environment with multiple distractions, which may contribute to an increased cognitive load [ 5 ], affecting the learning performance. Testing both learning performance and app users’ cognitive load may improve the understanding of the app’s usability.

We found that several of the included studies did not use terminology from usability literature to describe which usability attributes they were testing. For instance, studies that tested satisfaction often used words such as “likes and dislikes” and “recommend use to others” and did not specify that they tested the usability attribute satisfaction. Specifying which usability attributes are investigated will be important when performing a usability study of mobile apps, as this will influence transparency and enable comparison between different studies. In addition, evaluating a wider range of usability attributes may enable researchers to expand their perspective regarding the app’s usability problems and ensure quicker improvement of the app. Defining and presenting different usability attributes in a reporting guideline can assist in deciding on and reporting relevant usability attributes. As such, a reporting guideline would be beneficial for researchers planning and conducting usability studies, a point that is also supported by the systematic review conducted by Kumar and Mohite [ 1 ].

Future Directions

Combining different usability evaluation methods that incorporate both subjective and objective usability measures can add various and important perspectives when developing apps. In future studies, it would be advantageous to use psychometrically tested usability questionnaires to support the advancement of the usability science. In addition, developers of mobile apps should determine which usability attributes are relevant before conducting usability studies (eg, by registering a protocol). Incorporating these perspectives into the development of a reporting guideline would be beneficial to future usability studies.

Strengths and Limitations

First, the search strategy was designed in collaboration with a research librarian and peer reviewed by another research librarian and included 10 databases and other sources. This broad search strategy resulted in a high number of references, which may be associated with a lower level of precision. To ensure the retrieval of all potentially pertinent articles, two of the authors independently screened titles and abstracts; studies deemed eligible by one of the authors were included for full-text screening.

Second, the full-text evaluation was challenging because the term usability has multiple meanings that do not always relate to usability testing. For instance, the term was used when testing students’ experience of a commercially developed app but not in connection with the app’s further development. In addition, many studies did not explicitly state that a mobile app was being investigated, which also created a challenge when deciding whether they satisfied the eligibility criteria. Nevertheless, reading the full-text articles independently by 2 reviewers and solving disagreements through consensus-based discussions ensured the inclusion of relevant articles.

This scoping review was performed to provide an overview of the usability methods used and the attributes identified in usability studies of mobile apps in health care education. Experimental designs were commonly used to evaluate usability and most studies used field testing. Questionnaires were frequently used for data collection, although few studies used psychometrically tested questionnaires. Usability attributes identified most often were satisfaction, usefulness, and ease of use. The results indicate that combining different usability evaluation methods, incorporating both subjective and objective usability measures, and specifying which usability attributes to test seem advantageous. The results can support the planning and conduct of future usability studies of the advancement of learning apps in health care education.

Acknowledgments

The research library at Western Norway University of Applied Sciences provided valuable assistance in developing and performing the search strategy for this scoping review. Gunhild Austrheim, a research librarian, provided substantial guidance in the planning and performance of the database searches. Marianne Nesbjørg Tvedt peer reviewed the search string. Malik Beglerovic also assisted with database searches. The authors would also like to thank Ane Kjellaug Brekke Gjerland for assessing the data extraction sheet.

Abbreviations

PRISMA-ScRPreferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews
SUSSystem Usability Scale

Multimedia Appendix 1

Multimedia appendix 2, multimedia appendix 3.

Authors' Contributions: SGJ, LL, DC, and NRO proposed the idea for this review. SGJ, DC, and NRO contributed to the screening of titles and abstracts, and SGJ and TP decided on eligibility based on full-text examinations. SGJ extracted data from the included studies. SGJ, TP, LL, DC, and NRO contributed to the drafts of the manuscript, and all authors approved the final version for publication.

Conflicts of Interest: None declared.

IMAGES

  1. Usability Evaluation Methods

    usability evaluation methods a literature review

  2. Usability evaluation methods a literature review.pdf

    usability evaluation methods a literature review

  3. Usability Evaluation Method

    usability evaluation methods a literature review

  4. Usability evaluation methods

    usability evaluation methods a literature review

  5. Classification of some usability evaluation techniques

    usability evaluation methods a literature review

  6. Usability evaluation methods in formative usability evaluation (based

    usability evaluation methods a literature review

COMMENTS

  1. Usability evaluation methods: a literature review

    Usability evaluation methods: a literature review February 2012 International Journal of Engineering Science and Technology 4 (2) February 2012 4 (2) License CC BY 4.0 Authors: Ankita Madan

  2. PDF USABILITY EVALUATION METHODS: A LITERATURE REVIEW

    This paper presents a comprehensive study of different usability evaluation methods. The objective of this paper is to lay down the intensive and conceptual study of the usability concepts.

  3. Usability Evaluation Methods: A Systematic Review

    This chapter aims to identify, analyze, and classify the methodologies and methods described in the literature for the usability evaluation of systems and services based on information and communication technologies. The methodology used was a systematic review of the literature.

  4. A Review of Usability Evaluation Methods and their Use for Testing

    Conclusions: In summary, this paper provides a review of the usability evaluation methods employed in the assessment of eHealth HIV eHealth interventions. eHealth is a growing platform for delivery of HIV interventions and there is a need to critically evaluate the usability of these tools before deployment.

  5. Usability Evaluation Methods: A Systematic Review

    Abstract. This chapter aims to identify, analyze, and classify the methodologies and methods described in the literature for the usability evaluation of systems and services based on information ...

  6. Usability Evaluation Methods of Mobile Applications: A Systematic

    The usability evaluation process of mobile applications is carried out with a systematic literature review of 22 papers. The results show that 73% of the methods used are usability testing, 23% heuristic evaluations, and 4% are user satisfaction usability evaluations.

  7. A systematic literature review of mobile application usability

    A Data Extraction Form (see Appendix A) was used to manage the review results, including (1) author (s) of the paper; (2) year of publication; (3) suggested usability evaluation methods, approaches, and models; (4) category of the mobile app; (5) usability attributes and features used for mobile app design and evaluation; and (6) usability ...

  8. Systematic review of applied usability metrics within usability

    For included studies, we recorded usability evaluation methods or usability metrics as appropriate, and any measurement techniques applied to illustrate these. We classified and described all usability evaluation methods, usability metrics, and measurement techniques. Study quality was evaluated using a modified Downs and Black checklist.

  9. Usability evaluation methods for the web: A systematic mapping study

    The goal of this systematic mapping study was to examine the current use of usability evaluation methods in Web development. The principal findings of our study are the following: -. Usability evaluation methods have been constantly modified to better support the evaluation of Web artifacts.

  10. Current Trends in Usability Evaluation Methods: A Systematic Review

    Since usability is considered as a critical success factor for any software application, several evaluation methods have been developed. Nowadays, it is possible to find many proposals in the literature that address to evaluate usability issues. However, there is still discussion about what usability evaluation method is the most widely accepted by the scientific community. In this research, a ...

  11. Usability Evaluation of Dashboards: A Systematic Literature Review of

    The exclusion criteria were as follows: (1) non-English studies, (2) focusing on only dashboard design or dashboard evaluation, (3) use of evaluation methods other than questionnaires to evaluate usability, and (4) lack of access to the full text of articles. 2.3. Study Selection, Article Evaluation, and Data Extraction.

  12. Usability: An introduction to and literature review of usability

    Ideally usability testing should take place iteratively throughout the design of the resource, and there are several approaches for undertaking usability testing described in the wider literature. Within radiation oncology education, the extent to which usability testing occurs remains unclear.

  13. A Review: Healthcare Usability Evaluation Methods

    Several types of usability evaluation methods (UEM) are used to assess software, and more extensive research is needed on the use of UEM in early design and development stages by manufacturers to achieve the goal of user-centered design. This article is a literature review of the most commonly applied UEM and related emerging trends.

  14. Users' design feedback in usability evaluation: a literature review

    Gathering users' design feedback as part of usability evaluation may be seen as controversial, and the current knowledge on users' design feedback is fragmented. To mitigate this, we have conducted a literature review. The review provides an overview of the benefits and limitations of users' design feedback in usability evaluations.

  15. A Systematic Literature Review of Usability Evaluation Guidelines on

    Therefore, a systematic literature review was conducted in order to identify usability evaluation guidelines for mobile educational games, which are concerning primary school students as users. This work is the first step toward making a set of usability guidelines for the evaluation of mobile educational games for Primary school students.

  16. Usability of mobile applications: literature review and rationale for a

    A literature review, outlined in the following section, was conducted as validation of the PACMAD model. This literature review examined which attributes of usability, as defined in the PACMAD usability model, were used during the evaluation of mobile applications presented in a range of papers published between 2008 and 2010. Previous work by Kjeldskov & Graham [ 3] has looked at the research ...

  17. Evaluation of Usability and Accessibility of Mobile Application for

    Usability evaluation is the evaluation of the product or system context of use, which is determined by the users, environment, tasks, and equipment. As the field of usability evaluation research has evolved, researchers have developed a variety of ways to apply the evaluation of usability methods. This systematic review aims to identify topics, trends, categories, methods and to answer ...

  18. Usability: An introduction to and literature review of usability

    Through illustrative examples identified in the literature review, we demonstrate that usability testing is feasible and beneficial for educational resources varying in size and context. In doing so we hope to encourage radiation oncologists to incorporate usability testing into future educational resource design.

  19. A literature review about usability evaluation methods for e-learning

    This review is a synthesis of research project about Information Ergonomics and embraces three dimensions, namely the methods, models and frameworks that have been applied to evaluate LMS. The study also includes the main usability criteria and heuristics used. The obtained results show a notorious change in the paradigms of usability, with ...

  20. Development of Usability Guidelines: A Systematic Literature Review

    Literature survey is the dominant data collection technique for developing usability guidelines, while usability evaluation is the most common technique for validating newly developed guidelines. The findings concluded that there is a lack of standardization for developing, implementing, and evaluating usability guidelines.

  21. Usability evaluation for geographic information systems: a systematic

    In this article, we present a comprehensive review of the relevant literature. We analyze and compare publications of GIS usability evaluations, the methods that were used, and the findings reported. We thus produce a more detailed picture of GIS usability evaluations.

  22. Agile, Easily Applicable, and Useful eHealth Usability Evaluations

    Conclusion We conducted a systematic review and expert-validation to identify rapidly deployable eHealth usability evaluation methods. The comprehensive and evidence-based prioritization of eHealth usability evaluation methods supports faster usability evaluations, and so contributes to the ease-of-use of emerging eHealth systems.

  23. Frontiers

    Person-based design and evaluation of MIA, a digital medical interview assistant for radiology ... its system architecture and the results from a comprehensive evaluation of the system including usability assessment. 2 Methods. ... which were derived from a narrative literature review and a patient survey, and supplemented by specifications ...

  24. Sustainability

    The study re-examined the space-social security role of complex features in communities during normal and epidemic periods and developed a three-level evaluation system using methods including literature crawling, high-frequency screening, and hierarchical analysis.

  25. Validity of evaluation scales for post-stroke depression: a systematic

    Background Post-stroke depression (PSD) is closely associated with poor stroke prognosis. However, there are some challenges in identifying and assessing PSD. This study aimed to identify scales for PSD diagnosis, assessment, and follow-up that are straightforward, accurate, efficient, and reproducible. Methods A systematic literature search was conducted in 7 electronic databases from January ...

  26. Usability Methods and Attributes Reported in Usability Studies of

    Objective The aim of this scoping review is to identify usability methods and attributes in usability studies of mobile apps for health care education.