Sound Measurement of Patients Reported Outcomes

RESEARCH ARTICLE

  • Satyendra Nath Chakrabartty 1

Indian Ports Association, Indian Statistical Institute, India.

*Corresponding Author: Satyendra Nath Chakrabartty,Indian Ports Association, Indian Statistical Institute, India.

Citation: Satyendra Nath Chakrabartty , Sound Measurement of Patients Reported Outcomes , 1(1). New Healthcare Advancements and Explorations (NHAE) DOI: https://doi.org/10.64347/3066-2591/NHAE.001

Copyright: © (2024), Satyendra Nath Chakrabartty , this is an open-access article distributed under the terms of The Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Received: March 26, 2024 | Accepted: April 26, 2024 | Published: June 11, 2024

Abstract

Patient-reported outcome scales (PROs) with different number of items containing K-point items, K= 2, 3, 4, 5, and different scoring systems suffer from methodological limitations.

The paper gives method to convert item scores to continuous, monotonic, equidistant scores following normal distribution satisfying desired properties and facilitating parametric analysis

Ordinal item-score are transformed to equidistant scores (E_i-scores) by assigning different weights to the levels of different items followed by standardization and further transformation to proposed item-scores (P_i-scores) irrespective of length of scale and width of items. Score of i-th dimension (D_i) is the sum of P_i-scores of items belonging to the dimension and scale score (P) is the sum of all D_i s. Normally distributed D_i-scores and  P-scores facilitate meaningful comparison of patients and group of patients including assessment of progress or effectiveness of treatment plans, drawing of path of progress across time for better prognostication, testing statistical hypothesis of equality of mean using t-statistic for independent samples or using paired t-statistic for dependent samples  e.g. pre-treatment and post-treatment to a group, finding equivalent scores of PRO-1 and PRO-2, so that area under normal curve up to P_(PRO-1)^0 = area under normal curve up to P_(PRO-2)^0. Methodological novelties include among others use of highest eigenvalue λ_1  to find factorial validity (FV) reflecting the main factor being measured by the questionnaire; maximum value of test reliability α_PCA; finding relationship between 〖FV〗_(Z-scores) and α_PCA and also relationship between r_(tt(theoretical)) and FV. 


Keywords: Patient-reported Outcome Scale; Normal distribution; Progress path; Equivalent scores, Factorial validity, Reliability

Introduction

Recent advances in treatment, management and diagnosis of diseases have considerably improved health care by improving patient outcomes and quality of life (QoL) (Jordan & Tchantchaleishvili, 2021; Wang & Jang, 2022; Wang et al. 2022). Healthcare outcomes improvements require methodologically sound measures of outcomes. Need of rigorous assessment of outcomes for patient safety and treatment quality has been highlighted (MacGillivray, 2020). However, methodological issues in measurement of health outcomes have not been addressed adequately (Streiner and Norman, 2008). Major issues to be resolved include among others scoring system facilitating meaningful aggregation of items and dimensions, finding distribution of tests scores, parametric statistical analysis, etc. for better evaluation and differentiations among subjects and existing tools (Panagiotakos, 2009). 

A number of tools are being used to assess pertinent outcome measures for diagnosis, therapeutic and rehabilitation approach (Okkersen et al. 2018). Outcome measures used in clinical set up could be (i) Patient-reported, using disease specific or generic questionnaires where score of an individual is taken as sum of item scores – in ordinal scale (Kyte et al. 2015), (ii) Performance-based, primarily for  physiologic factors where patients perform a set of movements/tasks and scores  are assigned either based on an objective measurement (like time to complete a task- in ratio scale ) or a qualitative assessment (like normal or abnormal for a given task),(iii) Observer-reported, completed by parents, caregivers who regularly observes the patient on a daily basis and (iv) Clinician-reported, completed by a health care professional using clinical judgments and signs. For the same disease, different types of outcome measures may be used. For example, outcomes measures of non-curable Myotonic Dystrophy type 1 (DM1) include:

Operator dependent 5-point ordinal muscle impairment rating scale (MIRS) involving manual muscle testing (MMT) of 11 muscle groups for identification of stages and progression of DM1 and covers five different stages: MIRS-1 (no muscular impairment); MIRS-2 (myotonia, jaw and temporal wasting, facial weakness, neck flexors weakness, ptosis, nasal speech, no distal weakness except isolated digit flexor weakness); MIRS-3 (distal weakness, no proximal weakness except isolated elbow extensor weakness); MIRS- 4 (mild to moderate proximal weakness); MIRS-5 (severe proximal weakness) (Mathieu et al. 2001).

Performance based outcome measures in DM1 are: The Six-Minute Walk Test (6-MWT) (walking capacity over longer distances); The 10-meterWalk Test (10-mWT) (walking speed over a short distance); The 30-second chair-stand test (30-sCST) (lower limb strength and dynamic balance); The Nine-Hole Peg Test (9-HPT) (upper extremity function, specifically fine dexterity and coordination), etc. (Gagnon et al. 2015). 

To evaluate characteristics of gait alterations in ambulant patients, Saggio et al. (2021) suggested two types of severity index viz. SI-1 and SI-2 based on plot of elapsed times in both plantar-flexion (PI) (negative- angles) and dorsi-flexion (DI) (positive – angles) in Y-axis within a “narrow” time interval in X-axis. 

Results of such outcome measures vary. Thus, selection of appropriate outcome measures is critical for better understanding of current status and progress/decline, relapse or development of adverse reaction or a new disease entity (like infection) of patients over time (Hefford et al. 2011). 

From the angle of measurement, outcome measure in ratio scale with fixed zero point facilitates systematic addition, subtraction, multiplication, division and undertaking parametric statistical analysis. However, outcome measures in ordinal scales containing K-point items (K= 2, 3, 4,5 ……) suffer from methodological limitations. For example, Likert scales assume that distance between two successive levels of an item is equal i.e. for a 7-point item, it assumes constant value of distance between j-th and (j+1)-th levels ∀ j =1, 2, 3, 4, 5, 6. Equal psychological distance between levels will provide exact measurements of the psychological trait being assessed (Wakita, et al. 2012). Arithmetic averages requiring equidistant scores are not meaningful for ordinal item scores (Jamieson, 2004) and (X ) ̅  > or

Distribution of scores of items, dimensions and test are different and skewed. For two variables X and Y, X Y = Z is meaningful if X and Y follow similar probability distribution and distribution of Z is known for further operations. Thus, it is necessary to know probability density function (pdf) of X and Y and their convolution. 

PCA, FA, t-test, paired t-test, F-test, etc., assume normal distribution of the variables under study.  Results may go wrong if assumptions of the techniques are violated. Outcome scores emerging from questionnaires do not satisfy the normality assumption. 

High may not imply linearity between X and Y. Chakrabartty (2020) gave an example of and > 0.9 despite each of is non-linear function of X, due to non-satisfaction of assumptions of linear regression of on X where the error score did not follow normal distribution.  One possible solution to the above said problem areas is to transform item scores to follow similar distribution say Normal distribution. 

The paper gives a multi-stage method to convert item scores to continuous, monotonic, equidistant scores followed by standardization and further linear transformation to ensure fixed score range from 1 to 100 and normality and scale score is taken as sum of all normally distributed item scores.

Literature survey

Attempts have been made to transform K-point scales to L-point scales where K < or>

Scoring of PROs involve different methods to obtain dimension/scale scores. While dimension score of MacNew Heart Disease Health–Related Quality of Life Questionnaire (MacNew) is taken as arithmetic average of the responses in that dimension, Cardiovascular Limitations and Symptoms Profile (CLASP) scores are weighted to provide a total for each subscale. Each dimension of Myocardial Infarction Dimensional Assessment Scale (MIDAS) is scored separately. Such dimension scores create difficulties in meaningful computation of mean, SD, distribution of scale scores for meaningful comparisons, ranking, classifying individuals, and statistical inferences.

Cronbach alpha for test reliability assumes that each item measures the single latent trait on the same scale. PROs involving multiple factors violate the assumption and thus, Cronbach alpha may underestimate reliability of a PRO (Daniel, 1990).  The coefficient alpha is influenced by variance sources, unknown-direction of sampling errors (Terry & Kelley, 2012), sample size (Charter, 1999).and even number of items (Luh, 2024). Moreover, Friedman’s nonparametric tests cannot quantify interaction effects (Luepsen, 2017).  Aligned Rank Transform (ART), a non-parametric factorial ANOVA analyzes the interaction and also the main effects, by aligning the data for each effect (main or interaction), followed by assignments of ranks. Alignment works best for completely randomized designs (King et. al. 2003). 

List of PRO scales is too long. The Australian Commission on Safety and Quality in Health Care (2016) reviewed Patient- Reported Outcome measures (www.safetyandquality.gov.au). 

PROs vary in terms of number of items (length) and number of levels (width) as can be seen from the illustrative scales for insomnia given below: 

- Insomnia Severity Index (ISI): Consists of 7- number of 5-point items marked as 0 to 4 Individuals with score 14 are taken as Normal and those scoring > 14 are considered as having insomnia (Chahoud et al. 2017).

- Pittsburgh Sleep Quality Index (PSQI): Total 19-items, where first four items are open and each of the rest items is in 4-point scale from 0 to 3(Buysse et al. 1989). A score > 5  implies poor sleep quality and higher score implies worse sleep quality. 

- Insomnia Symptom Questionnaire (ISQ): 13- Items. Items 1 – 5 are 6-point from 0 to 5 and Item 6 -13 are in 5-point scale from 0 to 4 (Okun et al. 2009).

Following major problem areas may please be noted:

  • Each of ISI, PSQI, and ISQ generates ordinal scores and their distributions are unknown. Lack of meaningful addition of item scores to get dimension scores and scale scores fails to satisfy many desired properties. 
  • Different length and width of ISI, PSQI, and ISQ result in different contributions of dimensions covered by the scales. Mean, variance of PSQI with 19 items exceed the same of ISI and ISQ. 
  • Psychometric properties of multidimensional ISI, PSQI, and ISQ are different. Assumptions of Cronbach alpha are violated by scales measuring more than factor. Validity as correlation between a multidimensional scale score and criterion scores is the validity of which dimension /factor? Can we have validity of a scale for the main factor for which the scale was developed? Is it possible to have relationship between test reliability and test validity?
  • Use of zero as an anchor value does not help to define expected values (value of the variable × probability of that value) of level-wise scores, unnecessarily reduces mean and variance of the scale, item-total correlations, regression or logistic regression may be inappropriate due to presence of many zeroes. If each respondent of a sub-group selects the level marked as “0” to an item then computation of between group variance will be difficult since mean = variance = 0 for the sub-group and correlation with that item is undefined. Stucki et al. (1995) found more than 40% of the patients scored zero in 10 subscales of Sickness Impact Profile (SIP) and in one subclass of SF-36. Better is to mark the anchor values as 1, 2, 3,….. and so on, keeping the convention of higher score ⇔ higher value of the variable being measured.
  • Higher score in each of Nottingham Health Profile (NHP), Minnesota Living with Heart Failure (MLHF) indicate higher health problems, unlike Sickness impact Profile (SIP). Thus, directions of scores are different for different scales. 
  • Different PROs suggest different cut-off scores. For example, cut-off score of Stroke-Adapted Sickness Impact Profile (SA-SIP30) with 30 items covering 8 subscales is >33 and the same for Sickness Impact Profile (SIP136) with 136 “Yes–No” type items distributed over 12 domains is > 22. Question arises whether, a score of 33 in SA-SIP30 is equivalent to the score of 22 in SIP136? Similarly, score of 14 in ISI indicating “no insomnia” is equivalent to which score in PSQI or ISQ?  Such questions highlight need of comparing the PROs with special emphasis on finding equivalent scores of two scales for the purposes of diagnosis and classification of individuals. Silva et al. (2014) observed that comparison of cut-off points of PROs or QoL questionnaires is not possible and suggested further investigations on different cut-off points for better comparisons. Based on treatment status for Cancer Core Questionnaire (EORTC QLQ-C30), four different cut-off scores were found (Lidington et al. 2022)

3. Suggested Remedial action:       

3.1: Pre – adjustment of data: 

i)  Ensure that each item is positively related to intensity of the trait in question i.e. higher the item score, higher is the intensity of the disease or the dimension. For the variables like Platelet count, WBC count, % Myeloid cells in peripheral blood, etc. where lower value indicates higher risk to cancer, reciprocal of such variables are taken. For variable like Basophils, a type of white blood cell, a single value is given in the reference range instead of a range; an agreed particular value may be taken as the standard. 

ii) Assign 1, 2, 3, 4, 5…to the levels or response-categories of items avoiding zero. 

3.2 Converting ordinal score: 

Let be the raw score of the i-th patient in the j-th item, for and takes discrete value 1, 2, 3, 4 and 5 for a 5-point item. Let be the frequency of Ordinal can be converted to normally distributed continuous, monotonic, equidistant scores by following stages: 

Stage I. Equidistant scores: 

For a 5-point item, find weights for different values of i and j so that  Equidistant property and monotonic condition will be satisfied if forms an arithmetic progression with a positive value of the common difference. Two ways to find such weights are as follows:

Method 1: Procedure for obtaining of an item considering area under is illustrated in Table 1

Table – 1: Calculation of weights based on area under N (0, 1)

Response

Category

Proportion  

Cumulative

Proportions ()

Area under the standard Normal curve

Initial

Weights

1
2A2= Up to
3
4+ +
5+=1.00 +
Total1.00 1.00

P=[ (99(Z_ij- Min(Z_ij ))/(Max (Z_ij )- Min(Z_ij ) )]+1 

Parameters of the distribution of the i-th item, and can be estimated from the data. Item-wise -scores as per (1) are applicable irrespective of length of scale and width of items. Thus, all items have same score range.
 Dimension score is taken as sum of normally distributed P-score of relevant items contained in the dimension following normal with mean and SD = and the Scale score is sum of dimension scores (or item scores) each following normal. 

Properties: 

Continuous, equidistant and monotonic E-scores obtained by assigning different weights to the levels of different items by method-1 and method-2 are highly correlated. However, method-2 avoiding Standard Normal Table appears to be straightforward.

 can be taken as zero value for scoring K-point items as weighted sum. 

Equal importance to items and dimensions are avoided by item-wise E-scores and scale scores (P-scores). Normality ensures meaningful admissibility of arithmetic aggregation. 

P-scores offer practically zero tied scores and thus, can better discriminate the respondents with tied raw scores and assign unique ranks to individuals and facilitate parametric analysis. 

For items in ratio scales, transformation to E-scores are not required and can be standardized and transformed to follow normal distribution in the score range [1, 100]. 

3.3 Benefits of P-scores:

Benefits of proposed scores:

 Dimension score (D_i )  and proposed scale scores(P) are continuous, monotonic, normal and enable undertaking parametric analysis including estimation of population mean (μ), 

population variance (σ^2), confidence interval of μ, testing statistical hypothesis like H_0: μ_1=μ_2 or H_0: σ_1^2=σ_2^2 etc. for snap-shot data and also for longitudinal data.

Evaluate progress of i-th patient in time-period (t) over the previous period by (P_(i(t))-P_(i(t-1)))/P_(i(t-1)) ×100. Decline is indicated if P_(i(t))-P_(i(t-1))<0> (P_(i(t-1)) ) ̅ indicates progress. Normally distributed P_i satisfying assumptions of t-test, paired t-test helps to test H_0: μ_(P_t ) = μ_(P_((t-1)) ) and also H_0: 〖Progress〗_((t+1)over t) = 0, reflecting effectiveness of the treatment plans. Decline if any, may be probed to find the critical dimension(s) where D_(i(t))-D_i(t-1) <0>

Graph depicting progress/decline of a patient or a sample of patients at various time points is analogous to hazard function and can be used to compare response to treatments from the start.  Such trajectories can help to identify high-risk groups.

Effect of small change in D_i to scale score (P) can be expressed by percentage change of P due to small change in D_i i.e. elasticity indicating relative importance of the dimensions. The dimensions can be ranked in terms of elasticity.

For two scales X with normal pdf f(x)   and Y with normal pdf g(y), one can find regression equation of the form Y=α_1+β_1 X to predict Y from X or X=α_2+β_2Y to predict X from Y. However, the two regression lines differ and thus, empirical relationship between X and Y will not be unique. For a given value say x_0, better is to find equivalent score combinations (x_(0,)  y_0) of two scales by solving the equation 

∫_(-∞)^(x_0)▒〖f(x)dx=∫_(-∞)^(y_0)▒g(y)dy〗                                                                                            (2)

This avoids the problems of linear equating or percentile equating. The equation (2) can be solved using standard normal table (Chakrabartty, 2021). The method of finding equivalent score-combinations is possible even if the scales have different length, width and dimensions.

 Normally distributed scores satisfy the assumptions of PCA, FA and enable finding Factorial Validity (FV) = λ_1/(∑▒λ_i ) = λ_1/(∑▒S_(X_i)^2 )  where λ_1  is the highest eigenvalue. FV reflects the main factor being measured by the questionnaire (Parkerson et al. 2013). Item validity can be computed as the correlation of the item with the principal component or item validity. Here, sum of item validities ≠ Scale validity.  Eigenvalue ≈0 indicates existence of multicolinearity among the items. Test of significance of the largest eigenvalue can be done by Tracy–Widom (TW) test statistic U = λ_1/(∑▒λ_i ) which follows a TW-distribution i.e. distribution of the normalized λ_1of a Hermitian matrix (Nadler, 2011). Such FV avoids the shortcomings of construct validity and selection of criterion scale with matching constructs and administration of the scale and also the criterion scale. 

  For standardized item scores, 〖FV〗_(Z-scores) of a test is λ_1/m and the test variance S_X^2 can be written as  S_X^2= ∑▒λ_i + 2∑_(i≠j=1)^m▒〖Cov(X_i,X_j)〗= λ_1/FV+2∑_(i≠j=1)^m▒〖Cov(X_i,X_j)〗                                            (3)

Thus, theoretical reliability r_(tt(theoretical)) = (S_T^2)/(S_X^2 )=  (S_T^2  )/(λ_1/FV+2∑_(i≠j=1)^m▒〖Cov(X_i,X_j)〗)                                             (4)

Equation (4) gives non-linear relationship between r_(tt(theoretical)) and factorial validity.

  Maximum value of test reliability (α_(PCA ) )as a function of λ_1 derived from the correlation matrix of m-number of items was given by Ten Berge and Hofstee (1999) as

 α_PCA= (m/(m-1)) ( 1-1/λ_1 )                                                                                                                      (5)

Relationship between FV and α_PCA  as given in equation (5) is:

 α_PCA= (m/(m-1)) ( 1-1/λ_1 ) = (m/(m-1)) ( 1-1/(FV.∑▒λ_i )) = (m/(m-1)) ( 1-1/(m.〖FV〗_(Z-scores) ))                  (6)

As per (6), higher value of 〖FV〗_(Z-scores) increases α_PCA

 Normality helps to estimate variance of each item, dimension and questionnaire, enabling estimation of Cronbach alpha for a dimension at population level as

 α ̂=(n/(n-1)) (1- (Sum of estimates of variance of items in the dimension))/(Estimate of variance of the dimension))                                                  (7)

Cronbach alpha of  a battery consisting of K-dimensions can be obtained as a function of dimension reliabilities by α ̂_Battery = (∑_(i=1)^K▒r_(tt(i))  S_Xi+ ∑_(i=1,i≠j)^K▒∑_(j=1)^K▒〖2COV(X_i,X_j)〗)/(∑_(i=1)^K▒S_Xi + ∑_(i=1,i≠j)^K▒∑_(j=1)^K▒〖2COV(X_i,X_j)〗)                             (8)

where r_(tt(i)) and S_xi denote respectively reliability and SD of the i-th dimension.

Population estimates of dimension and battery by (7) and(8) respectively are simple and avoid

complex methods of Heo et al (2015) assuming parallel measures and  involving estimation of unbiased sample covariance matrix; variance-covariance matrix of the population.

Discussion

The proposed method of transforming ordinal item score to follow normal distribution ensures admissibility of the operation “addition”. Sum of normally distributed scores of all items belonging to the i-th dimension is taken as the dimension score (D_i)  and scale score (P) is the sum of scores of all the dimensions (or equivalently the sum of scores of all the items). Each D_i and P follows normal even if the items differ in length and width.   Normally distributed P-scores with data driven estimates of the parameters facilitate meaningful comparison of patients and group of patients including assessment of progress or effectiveness of treatment plans, drawing of path of progress across time for useful conclusions and better prognostication, testing  statistical hypothesis H_0 : μ_1=μ_2  against H_1 : μ_1≠μ_2  using t-statistic for independent samples or using paired t-statistic for dependent samples e.g. pre-treatment and post-treatment to a group, finding equivalent scores of two PROs, finding equivalent score combinations (P_(PRO-1)^0,P_(PRO-2)^0) of two PRO scales ( and or equivalent class-boundaries in case of classification of individuals by each of the two scales) can be found by ∫_(-∞)^(P_(PRO-1)^0)▒〖f(x)dx=∫_(-∞)^(P_(PRO-2)^0)▒g(y)dy〗 i.e. 

area under normal curve corresponding to f(x) up to P_(PRO-1)^0  = area under normal curve corresponding to g(y) up toP_(PRO-2)^0. Such equivalent cut-off scores also satisfy 

〖Var.of group〗_(Score ≥ P_(PRO-1)^0  )/(Variance of PRO-1)=〖Var.of group 〗_(Score ≥P_(PRO-2)^0 )/(Variance ofPRO-2)  and can be used to evaluate efficiency of classification, say in terms of within group variance and between group variance.

Methodological novelties include among others use of highest eigenvalue λ_1  to find factorial validity (FV) reflecting the main factor being measured by the questionnaire; maximum value of test reliability  α_PCA as a function of λ_1; finding relationship between 〖FV〗_(Z-scores) and α_PCA and also non-linear relationship between r_(tt(theoretical)) and FV. In addition, normally distributed D_i  scores with estimated parameters help to find population estimate of Cronbach alpha for a dimension and Cronbach alpha of a battery consisting of K-dimensions. 

Conclusion

Methodologically sound approach given in the paper with wide application areas help significantly in evaluation of better assessment of outcomes and comparison of subjects and PROs along with measures of psychometric properties like reliability, validity, of tests and their  relationships including derived relationship between reliability and validity, each as a function of largest eigenvalue. Such relationships may be used to find optimal value of one psychometric parameter to maximize another parameter. Future studies may explore such potentials with empirical investigations, extension of factorial validity to battery of tests and construction of psychometric quality index of test and battery, in addition to empirical verification of the properties of proposed methods using real life data. 

Declarations

Acknowledgement: Nil

Conflict of interests: Nil

Funding

 Did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors. 

Informed Consent: Not applicable. The paper did not collect data from human participants.

Data availability statement: The paper did not analyse or generate any datasets, because the work proceeds within a theoretical and mathematical approach

CRediT statement: Conceptualization; Methodology; Analysis; Writing and editing the paper by the Sole Author

References