Virtual Fall Summit 2021

October 26 & 27, 2021

Summit Schedule

LODS logo NEW color - white background.j

The Performance Testing Council is supported in part by our Cornerstone Members:

This Summit is sponsored by:

PSI logo.jpg

Content Generation for Digital-first Assessments: A 21st Century Approach

Alina A. von Davier, PhD

Alina A. von Davier.jpg

Along with the advances in technology and psychometrics it became apparent that (quality) content development is at the core of the education industry, both for learning and assessment. The development of learning and assessment content has been a craft that has required a high level of expertise, often of the type that was built over the years on the job. In the fast-paced digital education and large-scale digital-first assessments this is difficult to sustain.

In this presentation I will describe an alternative approach to the development of an assessment based on a) creating a large item bank using language-model-based automatic item generation techniques, b) estimating their preliminary difficulties using natural language processing (NLP) models, c) piloting the items in the context of an adaptive test and a framework for updating item parameters. 

I will show how a psychometric framework combined with ML algorithms can support quality assessments for the 21st century. I will illustrate this approach with the Duolingo English Test. 

Alina A. von Davier, PhD

Chief of Assessment, Duolingo

Alina von Davier, PhD. is a Chief of Assessment at Duolingo and Founder & CEO of EdAstra Tech LLC. At Duolingo, von Davier and her team operate at the forefront of Computational Psychometrics. Her current research interests involve developing psychometric methodologies in support of digital-first assessments, such as the Duolingo English Test,  using techniques incorporating machine learning, data mining, Bayesian inference methods, and stochastic processes.

Two publications, a co-edited volume on Computerized Multistage Testing (2014) and an edited volume on test equating, Statistical Models for Test Equating, Scaling, and Linking (2011) were selected as the winners of the Division D Significant Contribution to Educational Measurement and Research Methodology award at American Educational Research Association (AERA).  Additionally, she has written and/or co-edited five other books and volumes on statistics and psychometric topics. In 2020, von Davier was awarded a Career Award from the Association of Test Publishers (ATP). In 2019 she was a Finalist for the Visionary Award of EdTech Digest.

Prior to Duolingo she was a Chief Officer at ACT, where she led ACTNext, a large R&D-innovation unit. Before that von Davier was a Senior Research Director at Educational Testing Service (ETS) where she led the Computational Psychometrics Research Center. Previor to that, she led the Center for Psychometrics for International Tests, where she was responsible for the psychometrics in support of international tests, TOEFL® and TOEIC®, and for the scores reported to millions of test takers annually.

Von Davier is currently the president of the International Association of Computerized Adaptive Testing (IACAT) and she serves on the board of directors for the Association of Test Publishers (ATP). She is a mentor with New England Innovation Network , Harvard Innovation Labs, and with the Programme via:mento at the University of Kiel, Germany.

She earned her doctorate in mathematics from Otto von Guericke University of Magdeburg, Germany, and her master of science degree in mathematics from the University of Bucharest, Romania.

Item Weighting and Exam Scoring of a Hybrid Performance-Based CBT

Mark Stevens

Mark Stevens.jpeg

In 2019, SAS converted its two most popular exams to performance-based testing. This followed several years of work with Pearson VUE and ITS to solve the technical challenges faced when deploying an exam with an embedded virtual PC to a global audience. In addition to the technical challenges, we also needed to redesign our scoring methods to allow for overweighting of the PBT items versus the standard items on the exam. We researched the available literature on item weighting to understand the valid reasons to overweight an item. We also implemented our first scaled score as our previous exams reported simple percentage scores. I'll report on our efforts to redefine our item and exam level scoring methods and discuss the exam's psychometric performance.

Key Objectives:

• Explain how SAS automatically scores performance-based exams.

• Discuss the available rationale for overweighting items.
• Explain SAS' approach to scaled scores.
• Share some basic psychometric properties of the exam.

Mark Stevens

Senior Certification Developer, SAS

Mark is a senior certification developer for SAS, an analytics software company based in Cary, North Carolina. He joined the SAS Global Certification team in 2010, having previously worked at Nortel where he first began working on exam development after a stint as a network engineer and technical instructor. While appreciating standard multiple-choice computer-based exams for IT certification, Mark is most excited about the increased validity of performance-based testing, and solving the technical and logistical challenges associated with scaling computer-based PBT globally. In his spare time, Mark is finally pursuing a credential in credentialing; an MS in Educational Measurement from UNC- Greensboro.

Key Principles in Developing a Valid and Reliable OSCE: Stations, Scenarios and Scoring

Nicole Evers

Nicole Evers.png

Objective Structured Clinical Examinations (OSCEs) have been a popular means of assessment in the medical profession for several decades. As a performance-based assessment tool, OSCEs have advantages over other methods of assessment such as multiple-choice tests or oral examinations. Well-designed OSCEs offer an effective assessment of task prociency under controlled conditions. In this session, key principles surrounding the development of reliable, valid, and fair OSCEs will be reviewed. Emphasis will be placed on practical knowledge required to guide the sound development of OSCEs.

Key Objectives:

Describe the OSCE: Design and Key elements

Provide an overview of important assessment principles in OSCE development which impact its validity, reliability and fairness.

Present practical considerations in development of stations, scenarios, and scoring

Nicole Evers

Psychometrician, Meazure Learning

Nicole holds a MSc in Occupational Psychology from the University of Nottingham (UK) and has over a decade of experience working with Government of Canada as an evaluation specialist and project lead on various job analysis, competency development projects and development of assessment strategies and tools. She is trained to assess, understand and diagnose organizations and individuals working within them. As a psychometrician at Meazure Learning, Nicole provides psychometric expertise on all aspects of testing development activities for licensure and certication programs. Her responsibilities include competency development, blueprinting, item development, test assembly, item and test analysis, and standard setting. Her goal with all clients is to engage and support them through their assessments processes.



VR Certification - Secure, Fair and Equitable

Wallace Judd, PhD

Wallace Judd.jpeg

Virtual Reality opens up a whole new world of certification possibilities. Revealing the construction of one certification illustrates issues the designer faced and how the results were psychometrically sound. One of the issues was maintaining security with a short, very memorable test. Another issue was creating variability while sustaining equality of difficulty for candidates. A third issue was calculating beta statistics on a sparse matrix of less than 100 Beta users. The certification is modeled on a single-person shooter game, but the principles can be expanded to multi-player and multi-world scenarios.

Key Objectives:

Learn how to use Playlists to structure the design of interactions.

Participants will find out how to use Playlists to create certification security.

• The methodology for generating Playlists of equivalent difficulty will be shown.

• Psychometric explanations of how to construct and verify playlists will be presented in low-level mathematical terms.

Wallace Judd, PhD

President, Authentic Testing Corp


Dr. Judd is a psychometrician who received his undergraduate degree from Princeton, his M.Ed. from Harvard, and his Ph.D. from Stanford University, doing his doctoral dissertation under Lee Cronbach on computerized adaptive testing.

He worked on the Lisa at Apple, was a senior programmer at Xerox PARC, and was Director of Training for Netscape.

As president of ComTrain, Inc., he developed the Judd Test series of ten computerized hands-on performance tests, starting in 1989, and has since developed or consulted on dozens of performance tests for clients, including Kelly Services, Oracle, Honeywell, and many others.

The former Executive Director of the Performance Testing Council, Dr. Judd is Chair of the subcommittee which developed a national standard for Performance Testing, ASTM E2849. He is also a former ANSI assessor for ISO/IEC 17024 and has done extensive research on performance testing standards, implementation and psychometrics. He has previously consulted with the American College of Surgeons on certification for Robotic Surgery. Author of 20 textbooks and numerous journal articles, he has two patents and is currently the President of Authentic Testing Corporation, a small performance-testing consultancy. Authentic Testing clients have included 18 major certification bodies, including among others Red Hat, Linux, Cloud Foundry, the Open Stack Foundation and Crossfit.