Sunday, August 19, 2012

Quality of Experience (QoE)

As a reader of this blog you are doubtless familiar with the concept of the Quality of Service (QoS) of a telecommunications service, by which we mean meeting defined levels of a set of measureable network parameters, such as availability, delay, delay variability, and information loss. The precise set of parameters depends on the service type; for example Bit Error Rate (BER) and Errored Seconds Ratio, are important to TDM services, while Bandwidth Profile and Delay percentiles are two of the parameters measured for Ethernet services.

On the other hand, you may be less familiar with the related concept of Quality of Experience (QoE).

QoE is defined as the acceptability of a service, as perceived subjectively by the end-user (see ITU-T E.800, P.10, G.1080, and the ETSI 2010 QoS QoE User Experience Workshop). It too depends on the service being provided, being diminished when the user perceives low voice or video quality, long response times, service outages, information loss, lack of service reliability, or inconsistent behavior. Unfortunately, the end-user may not always distinguish whether QoE degradation is due to a defect in the communications network or in an information processing resource; for example, response time to a database query is partially due to computational resource availability and speed, and partially due to network delays in both directions.

While QoE as defined above is absolute and subjective, for reasons that we will discuss below, it may be measured in comparative and/or objective ways. By absolute QoE we mean the quality perceived by an end-user based solely on the received information, while comparative QoE refers to the somewhat artificial case of an end-user who has access to the non-degraded information. Subjective QoE determination is the perception of a true end-user, while objective QoE means QoE estimated by an algorithm designed to correlate with true user perception.

Telecommunications Service Providers originally earned their income by providing basic connectivity, but now, in the age of free WiFi, Skype, hotmail, dropbox, and other free best-effort services, the service provider’s only justification for charging a fee is providing a certain QoE level. When the QoE remains above a certain threshold the service is perceived as good, and the end-user is content. Below that level but above some other threshold the end-user perceives service degradation, but is able to tolerate it. Below the lower threshold the user becomes frustrated and typically abandons the service; surveys show that a large percentage of users experiencing low QoE desert the service provider without ever complaining to the provider’s customer service department.

Unfortunately, direct measurement of QoE is often difficult, and so for many years guaranteeing QoS levels has served as a proxy for QoE guarantees. The theory is that for the QoE for a given application is always a function of the network QoS parameters
                                 QoE = f (application, QoS)
but until recently one could only guess the form of this function. However, it is important to emphasize that QoS does not map to QoE independently of the application. For example, for interactive applications such as voice conversations, low delay is critical while packet loss is relatively insignificant, while for others, e.g., progressive download over TCP, the opposite is true.

Unsurprisingly, the first QoE parameter to be directly measured was voice quality, since telephony was for many years the paramount telecommunications service. Telephony service providers promised “toll quality” speech (literally, quality for which they could charge a “toll”), and it was thus natural to specify what that meant. This QoE was quantified using the Mean Opinion Score (MOS), defined in ITU-T Recommendation P.800. MOS is measured by having a number of listeners subjectively scoring the speech quality on a scale from 1 (bad) to 5 (excellent), and averaging over these scores (finding the mean). Many variations are defined, including Absolute Category Rating (ACR) in which the listeners hear only the degraded speech, and a comparative method called Degradation Category Rating (DCR) in which the listeners hear both original and degraded speech and compare the two. The comparative method is used here because it often returns more accurate results.

Unfortunately, direct measurement of MOS in this fashion is an expensive and time-consuming task. So the ITU-T looked into ways of defining objective measures that could be automated. The first method developed was called PSQM (ITU-T P.861), and the second PESQ (ITU-T P.862). Both of these are objective comparative measures in that they compare degraded speech with the original telephone quality speech, using appropriate signal processing (such as computing a logarithmic scale frequency representation) to model the human auditory perception system. Similarly, PEAQ (ITU-R BS-1387) determines the quality of wideband audio. The particular methods were selected in competitions to have highest correlation with human MOS scoring.

PSQM, PESQ, and PEAQ are all comparative, and are thus not suitable for estimating quality in operational systems where only the degraded audio is available. This was rectified by the ITU-T P.563 single-ended method for measuring absolute objective speech quality. P.563 determines the un-naturalness of telephone-grade speech sounds and how much non-speech-like noise is present.

Another approach championed by the ITU-T is the E-model (Recommendation G.107). The E-model is a planning tool that predicts a mouth-to-ear “transmission rating factor” R between 0 and 100, with higher values signifying better voice quality. An R value should be uniquely convertible to a MOS level. The expression for R starts with the basic signal to noise ratio and reduces it to account for various impairments including simultaneous impairments (loudness, quantization noise), delay impairments (delay, echo), and equipment impairments (codec distortion, packet loss). On the other hand, R is increased to compensate for advantageous scenarios such as mobility (cellphone, satellite).

Several years before ITU-T’s P.563, ETSI TIPHON (Telecommunications and Internet Protocol Harmonization Over Networks) produced TS 101 329-5 on QoS measurement methodologies. Annex E of that document described VQMON a single-ended method for estimating the E-model factors for VoIP, based on network parameters.

But voice is not the only service for which QoE has been defined. The ITU-R produced BT.500 on the subjective assessment of television quality. It defines MOS-like scores - television sequences are shown to a group of viewers, and their subjective opinions are averaged.

Among the notable ITU-T Recommendations for video quality are :
  • P.910 Subjective video quality assessment methods for multimedia applications 
  • P.911 Subjective audiovisual quality assessment methods for multimedia applications 
  • P.920 Interactive test methods for audiovisual communications 
  • P.930 Principles of a reference impairment system for video 
  • P.931 Multimedia communications delay, synchronization and frame rate measurement 
  • J.143 User requirements for objective perceptual video quality measurements in digital cable television 
  • J.144 Objective perceptual video quality measurement techniques for digital cable television in the presence of a full reference. 
  • J.246 Perceptual audiovisual quality measurement techniques for multimedia services over digital cable television networks in the presence of a reduced bandwidth reference 
  • J.247 Objective perceptual multimedia video quality measurement in the presence of a full reference (PEVQ) 
  • J.341 Objective perceptual multimedia video quality measurement of HDTV for digital cable television in the presence of a full reference 
Since 1997 the principle body working on video quality is the Video Quality Experts Group (VQEG). VQEG has produced a tutorial on comparative (“full-reference”) objective assessment of television quality, and is working on others.
In addition to audio and video, the ITU has looked into multimedia and data applications. Recommendation G.1011 is a reference guide to existing standards for QoE assessment, and identifies a taxonomy for such standards.
Recommendation G.1010 discusses applications (conversational voice, voice messaging, streaming audio, videophone, one-way video, web-browsing, bulk data transfer, email, e-commerce, interactive games, SMS, instant messaging, etc.) and gives performance targets for delay, delay variation, and loss QoS parameters for each.
Recommendation G.1030 provides network planners with end-to-end (E-model-like) tools for applications over IP networks, with an appendix devoted to web browsing. The appendix presents empirical perception of users to response times, and proposes a MOS measure. This work is complemented by G.1050 which describes an IP network model that can be used for evaluating the performance of IP streams based on QoS parameters (delay, delay variation, and loss). Recommendation G.1070 proposes an algorithm that estimates videophone quality for planners. Other documents include J.163 on QoS for real-time services over cable modems, and X.140 on QoS parameters for public data networks.

Outside the ITU, the Broadband Forum (BBF) has produced TR-126, which is an excellent tutorial on QoE as well as a useful set of guidelines for the relationship between QoE and QoS for broadband triple play applications. The document commences with a definition of QoE that is consistent with that of the ITU-T, namely a measure of end-to-end performance from the user’s perspective, in contrast with QoS as metrics of network performance. TR-126 provides a clear relationship between the two, so that given a set of QoS measurements, one could predict the QoE for a user, and conversely given a target QoE, one could deduce the required network performance. TR-126 discusses QoE “dimensions”: service set-up, operation, and tear-down; QoE “facets”: user effort, application responsiveness, information fidelity, security, and dependability/availability; and the service, application, and transport “layers”. While QoE is quintessentially end-to-end, TR-126 breaks down the contribution of various segments, such as access technologies (e.g., DSL and PON), ISPs, and application service providers. Specific guidelines are given for video (various kinds of entertainment video, video conferencing, surveillance video, streaming video, …), voice (wired, wireless, voice messaging, IVR), and best-effort Internet data (web browsing, email, file transfer, VPN, P2P, ecommerce, …).

The TeleManagement Forum (TMF), as could be expected, has documents discussing QoE from the Service Level Agreement (SLA) management perspective. TMF’s Wireless Services Measurement Handbook GB923 defines Key Quality Indicators (KQIs) and Key Performance Indicators (KPIs), similar to QoE scores and QoS parameters. KQIs experienced by end-users may in principle be determined from KPIs (although the mapping may be complex), while KPIs are derived from QoS parameters. The TMF has defined a set of KQIs including response time, service availability, speech/video quality, transaction rate, offered throughput, etc. An SLA consists of a set of thresholds for KQIs and KPIs, and these are specified in the SLA Management Handbook GB917 and its Application Notes.

The Apdex Alliance is a group of collaborating companies that functions as a program under the auspices of the IEEE Industry Standards and Technology Organization (IEEE-ISTO). Its mission is to develop open standards that define standardized methods to report, benchmark, and track application performance. The Application Performance Index (Apdex) is a number between 0 and 1 that attempts to capture user satisfaction with an application. Zero signifies that no user would be satisfied, while 1 would mean that all users would be. More formally, users are divided into three categories, satisfied, tolerating, and frustrated; and the Apdex represents the ratio between the number of satisfied users and half the tolerating ones, to the total number of users.
                         Satisfied Count + Tolerated Count / 2
 Apdex = -----------------------------------------------------------------------------
               Satisfied Count + Tolerated Count + Frustrated Count
Apdex deconstructs application transactions into sessions (the “connect” time) consisting of processes (interactions accomplishing a goal) that are made up of tasks (individual interactions), and further into turns, protocols, and individual packets. The user is mainly aware of the task response time, since (s)he must wait for the task to complete before proceeding. For example, users may be satisfied if a web page completely loads within 2 seconds, and may tolerate the delay if it loads within 8 seconds. Above that frustration sets in.

The problem with all of the above subjective and objective QoE measures is that they are service/application-specific. Since new applications are coming out every day, and furthermore different users may use completely different features of a single application, it is no longer feasible to study each application in depth. A new approach being studied is behavioral QoE estimation, where the user’s satisfaction is gauged based on his actions and reactions. An extreme example is the high measured correlation between a user being unsatisfied with a service level, and his aborting the application (or at least waiting until the service level improves). Such behavioral QoE may be used to automatically map QoS to QoE for new applications, or may be used directly instead of traditional QoE measurement.

Y(J)S