Multi-Party Videoconferencing for the Web

Nikolaus Färber and Yaroslav Kryvyi
Multimedia Applications Department
Fraunhofer IIS
Erlangen, Germany

Vilmos Zsombori
Department of Computing
Goldsmith, Univerity of London
London, United Kingdom


ABSTRACT
There is increasing interest in videoconferencing for social networks, but web developers are facing two basic challenges: Easy integration and scalability. The latter can be solved by a scalable architecture based on central media routers, which makes pragmatic use of the available resources while allowing low complexity server components. Easy integration, on the other hand, requires a simple API with inherent support for multipoint and intelligent audio and video composition. In this paper we describe how the EU-funded research project Vconect has addressed these two challenges and compare the results with Google+Hangout and WebRTC. As a prove of concept, the Vconect platform has been integrated with SAPO Campus, a social network operated by Portugal Telecom Group, with positive results from initial trials.


1 INTRODUCTION
Videoconferencing in the web has made huge progress in the recent years and has become more and more popular among users of social networks. The most prominent example is Hangouts in Google+ [2], which has evolved into a very attractive and feature-rich service. However, Hangouts is tightly coupled to Google+ and cannot be used easily in other social networks or small web services. Especially when a tight integration with the web application is desired, developers cannot make use of Hangouts. As an alternative to Hangouts, WebRTC promises easy integration into web services by making it a part of web browser and HTML5. However, WebRTC is focused on point to point communication between browsers and does not support multipoint conferencing directly. Although WebRTC can be extended to address multipoint conferencing, the resulting solutions require additional control logic to be implemented, which is typically beyond the scope of a normal web developer. Hence, it is still difficult to integrate videoconferencing easily into any web application: Hangout cannot be used outside Google+ and WebRTC is not supporting multi-party conferencing in a convenient way. The EU-funded research project Vconect [1] has developed an alternative approach to video conferencing, which tries to fill this gap and make the integration and implementation as easy as possible for web developers. This approach is not using WebRTC or Google Hangouts but is based on an independently developed browser plugin. The goal is to make video conferencing as simple as integrating an image with the <img> tag in HTML. All the web developer has to care about is placing a <div> at an appropriate position on his web page, allowing the Vconect platform to take care of the logic behind and thus providing a good user experience. Vconect will consider the given size of the <div> and select an appropriate video composition and control strategy. This includes e.g. detecting who is currently speaking and enlarging his screen area. This intelligent control and presentation of a multi-party conversation in a mediated environment is termed “Orchestration” in Vconect, which is a key research contribution of this project. In this paper, however, we focus on the underlying architecture for media processing, i.e. media codecs and transmission. Special care has been taken to make effective use of network resources and keep the complexity of server components low, such that scalability to a high number of users becomes possible at low operational costs. In addition, we provide an overview of the Application Programming Interface (API) for web developers because we believe that easy integration is essential to the adoption in web applications.


2 SCALABLE VIDEOCONFERENCING ARCHITECTURE
There are several approaches to enable multi-party videoconferencing which impose different requirements on the client, server and network infrastructure, see Fig. 1. For example, clients can establish a fully meshed network in which each client establishes a peer to peer session with each of its N participants. Although this approach does not require any server infrastructure (except for session setup) it puts a heavy load on the upload channel because each client has to send N copies of his stream. In addition, each client has to decode N streams and mix those before playout, which increases its complexity.

FIGURE 1: ARCHITECTURES FOR VIDEO CONFERENCING

The classic approach in Telecommunication is that of a Multipoint Control Unit (MCU), which acts as a central bridge in a star topology. This architecture is convenient for the client, because it can treat the MCU as a normal client and therefore establishes a normal point to point call. Therefore, only a single encoding and decoding instance is needed and also the channel usage in the uplink- and downlink remains unchanged. The drawback of this architecture is the complex MCU, which has to do encoding, mixing, and re-encoding for each of the N+1 clients. This can become a problem for the operation of the platform as it becomes expensive to run, e.g. when each session requires its own dedicated hardware server. This is particular true for web platforms with 1000s of users which are not willing to pay extra money for the service (if any at all).

A good compromise between these two architectures is a central media router, which balances the requirements for the client and server in a pragmatic way. While maintaining a star topology, it replaces the bridge with a low complexity media router. Technically, this media router is implemented on the application layer and runs on a server with high bandwidth access. However, as the application running on this server is mainly forwarding packets, and therefore the term media router, or just router, is used. The key is, that the complexity of the router is minimal compared to the bridge. The drawback is a higher downlink bitrate and client complexity because each client receives streams from N participants and does decoding and playout similar as in the peer-to-peer approach. However, it turns out that a higher download bitrate is often acceptable in practice because of the asymmetric requirements in Internet traffic. For example ADSL links often have a ten times higher downlink speed than uplink speed and also 3G/4G mobile networks are much more limited in the uplink than in the downlink. Besides, the additional client complexity of N times decoding is also acceptable, because decoding is less complex than encoding and can be well handled by modern PCs. Hence, the central media router architecture makes pragmatic use of the available resources in the client CPU and downlink channel, while allowing low complexity server components as needed for cost effective operation for many users.

Because of the given advantages, the central media router architecture has been adopted successfully in the industry, for example by Vidyo [10], which has to be credited as one of the pioneers. However, the basic idea has already been proposed for audio conferencing earlier in 2003 by Prasad [14, 15]. The same approach can also be implemented in WebRTC when using special media routers [11, 16]. Also the Vconect project has decided to follow this approach as described in the next section.


3 VCONECT ARCHITECTURE AND COMPONENTS
The overall architecture and components of the Vconect platform are illustrated in Fig. 2 and are described in more detail below.

A. Client Side
The client side includes components for capturing, coding and transmitting audio and video.  These components are largely equivalent to those found in conventional video conferencing systems.  Encoding is based on H.264 video [12] and AAC-ELD [8] audio. IP encapsulation is based on RTP [13]. Video is encoded in multiple resolutions and bit rates, such that adaptation to the network and client resources becomes possible. Which layers are transmitted into the network and forwarded to remote clients is decided by the Reasoning Layer. Audio processing is based on the Audio Communication Engine (ACE), which is a VoIP engine handling AAC-ELD coding, IP streaming, and echo control in a single module [9]. As required for the central media router architecture, the ACE can decode multiple audio streams and mix those before playout. This layer also incorporates the Video Composition Engine (VCE) which receives and composes multiple decoded video streams for presentation to the user.

FIGURE 2: VCONECT ARCHITECTURE AND COMPONENTS

Moving up to the Analysis Layer, the Vconect client includes components for the automated analysis of captured audio and video streams, from which cues can be generated as an input to the Reasoning Layer. As a result, the Vconect platform can e.g. detect which participant is currently talking and show his video enlarged.
All of the above components and functionality is integrated into a browser plugin, such that integration into web applications becomes as easy as possible.

B. Server Side
At the Content Layer, two main components provide scalable transmission of audio-visual streams: the Video Router (VR) and the Audio Router (AR).
The VR is an efficient packet switch and replicator which connects multiple source video streams to multiple client targets. It does not implement any media processing but only forwards RTP packets. However, it acts as an end-point for RTCP [13] and can therefore monitor the network state on all links of the session. This network monitoring information is forwarded to the Reasoning Layer which is responsible for network optimization. This optimization is however out of scope for this contribution in which we focus on the media coding and transport, i.e. the Content Layer. Because the VR and AR are required to have a public IP address, they can also assist in NAT traversal and avoid the need for a full STUN/TURN/ICE implementation. In essence, the VR and AR act as an inherent TURN-relay and therefore simplify the overall architecture.
The AR behaves very similar to the VR but selects M out of N audio streams to be forwarded to each client, where M≤N and typically set to M=2 or M=3. This follows the idea proposed in [14] in which it is observed that in a normal group conversation there are very seldom more than 3 participants speaking at the same time. This is a result of behavioral rules which are followed implicitly by humans in conversations. For example, if the current speaker is interrupted by another participant in the group, he will not continue to speak but give the other person a chance to take the floor (at least intermediately). Exploiting this behavior, the AR limits the audio bit rate on the downlink to M times the bit rate of a single speaker. In order to select the M most active speakers, a low complexity energy estimation algorithm is used which does not require full decoding of the audio packets. Hence, the required complexity in the AR is much lower than normal audio decoding and can basically be neglected.
The Communication Layer includes the Session Manager, which is the hub of the communication framework in Vconect, enabling messages to be transmitted between components, and enabling users to find each other and join a Vconect session. While each videoconference session has its own instance of a Communication Manager (not illustrated in Fig. 2), the Session Manager is the central entry point for all clients starting new sessions.
Another important aspect for a state-of-the-art video-conferencing platform is its ability to run the server components in a commercial cloud platform. This allows scaling the service in an efficient way. In case of the Vconect platform, the AR has already been instantiated in the Amazon EC2 and work for the remaining components is ongoing.


4 DEVELOPER API
A clear and simple Application Programming Interface (API) is crucial from the perspective of a web developer who wants to integrate group videoconferencing into his web site. Without loss of generality, we assume that this web site implements a Social Network (SN) with the required server infrastructure. The goal is to start a videoconference in a similar way as embedding an image using the HTML <img> tag. But instead of seeing a JPG image on the web site, the user will see a live videoconference and be able to participate in the conversation.
To illustrate the overall architecture and message flow between the components we define four basic components as illustrated in Fig. 3.

1. SN Client
The Social Network Client is what the user sees and experiences when being on the social network. It is the web application running in the browser of the user and is typically implemented using HTML/CSS/JS. We assume that there is an existing web application that shall be extended with videoconferencing.

2. SN Server
The Social Network Server comprises all server components of the social network. This is where e.g. all user data is stored (pictures, messages, profiles) and the HTML content is served from. While the user is logged in the social network, the SN Client is talking to the SN Server through the SN-API, e.g. using HTTP and AJAX. This interface is out of scope for the Vconect platform but needs to provide some basic functionality, such as exchanging user identities.

3. VC Client
The Vconect Client is responsible for the actual video conference and is therefore transmitting and receiving live audio and video streams. It can be controlled from the SN Client using JavaScript through the Client API (VC-API). Multiple videos from multiple participants are composed and rendered into a rectangular area (<div>) of the SN Client. The VC Client is implemented as a browser plugin based on the Firebreath framework [5].

4. VC Server
The Vconect Server comprises all server components of the Vconect platform, including video routers, audio bridges, and session management. All communication between the Vconect Client and Server is aggregated in the VC-API. This includes all media streams and control messages. Those are out of scope for the web developer, i.e. he does not have to care about H.264, AAC, RTP, RTCP, etc.

FIGURE 3: COMPONENTS AND INTERFACES BETWEEN THE SOCIAL NETWORK (SN) AND THE VCONECT PLATFORM (VC).

There are two interfaces which a web developer has to consider for integrating a videoconference into a social network, i.e. the Client API (C) and Server API (S). The social network platform will have to use both to establish a videoconference.

1. Client API (C)
The Client API (C-API) is a JavaScript interface that controls the browser plugin. The SN Client loads the JavaScript library vConect.js from the Vconect Server which provides a convenient API to the web developer. It can be seen as glue code or wrapper code hiding some of the complexity in embedding the plugin and communicating with it.

2. Server API (S)
The two server systems communicate directly to each other through the Server API (S-API). For example, the SN Server will ask the VC Server to create a new session, i.e. a new video conference. This interface is using REST/JSON as a communication protocol. Currently it is focused on session creation/deletion and can provide information on running sessions and their participants.

The implementation of the SN-API is up to the social network platform and is not of concern to the Vconect platform. We simply assume that the relevant information can be exchanged, e.g. using AJAX. Similarly, the details of the VC-API are hidden from the web developer. He does not have to care how media is routed and which control messages are transmitted in the VC back end. Table 1 lists all methods on the Client-API and Server-API with a short description.

TABLE 1: OVERVIEW OF API

The starting point on the server side is creating a session through the createSession request; the VC platform allocates all required resources in the back end and returns a sessionID, which is comparable to the number/PIN of a conventional phone conference. It is in the responsibility of the SN platform to distribute the sessionID to all clients who want to join this session.
On the client side, the starting point is the SN-Client calling the JavaScript function vConect(divID,sessionID, userID). The userID is a unique identifier for the user in the social network and is provided by the SN platform. Joining a video conference is then triggered through a call to vConect.startClient(). If the createSession, vConect() and vConect.startClient() calls are successful, then this is all the web developer needs to do. I.e. at this point of time there should be live audio and video flowing between the participants who can immediately start their conversation.
The corresponding methods to leave a session and delete a session are vConect.stopClient() on the Client-API and deleteSession on the Server-API. Besides those and basic microphone control for muting (vConect.setMicrophoneOff(), vConect.setMicrophoneOn()), the only remaining API call to consider is the setting of view-modes.

A view-mode is a high level layout style, which describes the basic video composition and orchestration approach. As illustrated in Fig. 4, there are three view-modes which can be selected by the web developer and/or user.  In view-mode Tile, all participants are displayed simultaneously in tiles of equal size. There is no orchestration active as all participants are always visible. In view-mode Clean-Cut only a single participant is shown on the screen at any point in time and his video is scaled as big as the given <div> allows. The orchestration makes sure that the person who is currently talking is shown. The view-mode Standard, is a compromise between the above two options and very similar to Google+ (Google Plus) Hangouts. It shows the person who is currently talking in a larger video while still displaying the other participants in small videos (“thumb nails”). The transition during turn taking is animated. The corresponding methods are listed in Tab. 1.

FIGURE 4: ILLUSTRATION OF VIEW-MODES.

The described API is an interim specification and subject to change. However, the high level API as described above is expected to be relatively stable and mature. A lot of effort is needed to make the underlying technology work robustly in difficult network conditions. But this development is happening “under the hood” and will not affect the API.


5 DEPLOYMENT IN SAPO CAMPUS

The Vconect platform has been successfully integrated into SAPO Campus, a social network operated by SAPO, the Internet service provider of Portugal Telecom Group [6]. Currently focusing on schools and universities as closed user communities, SAPO Campus brings together all of SAPO’s social core services into a single out of the box web application. On top of the usual networking features, like activity feeds of status updates, comments, groups, private messages, it includes SAPO’s well-known services like blogs, photo, video and file sharing. All fully owned and managed by the community, with their own branding. SAPO Campus is very different in its nature from larger, better-known social networks, because people in this network are members of the same institution, providing a strong sense of community and responsibility.

SAPO Campus is currently in product deployment stage with potential schools being contacted and given workshops on how to use it. It is already being used by 30+ schools in a pilot program and plans are to have it deployed in a large number of schools by the end of 2014.

For the integration of Vconect-based videoconferencing, SAPO Campus has introduced the concept of Rooms as constant objects like Math or Geography classrooms at school as shown on Fig. 5. From Vconect’s perspective, a Room is just another name for a session created by the SN. Users can join a Room after having signed up for the class. After having been approved as a class member by an administrator, users can enter the Room and participate in the video call. This way Campus does its user management and provides VC only a client id. Each Room displays a list of members as a drop down menu highlighting also their online or offline status. All this functionality can be implemented  by utilizing the simple API described above.

FIGURE 5: SCREEN SHOT OF SAPO CAMPUS ROOMS.

It is worth mentioning that the integration of Vconect went faster than SAPO’s development team anticipated, which allowed initial trials already in 2013.  Fig. 6 displays a historical screen shot of the first successful and fully functional videoconference in SAPO Campus.
After the integration, Vconect had successful trials in SAPO Campus, involving 25 users. During each of the five sessions, five participants had to solve three different tasks in three different rooms, while there were always at least two of them in the same room. Besides checking the overall stability and doing basic beta testing, the main goal was to study the interplay between social network and videoconference usage.

Some results of the trials were surprising because they contradicted our assumption that the Standard view-mode would be the most preferred among the participants. However, most of the participants preferred the Tile view-mode, which resulted in the conclusion that in small groups of participants there is less need for orchestration. The general feedback was very positive, resulting in comments such as
  I loved the discussions and the interaction. We talked on day without the Internet?
and
  Very interesting – the different chat rooms. Very nice to go in and out.
Many participants, who were all already users of SAPO Campus at their university, asked
  When can we use it in our campus?
which encourages us in the continuation of our work.
 
Vconect has scheduled bigger trials for summer 2014. More SAPO Campus users will be offered to use Vconect for an extended period of time, i.e. for several weeks. Hundreds of simultaneous users will generate significant load on the server infrastructure, providing us with data on the efficiency of our multimedia transport and processing, as well as evaluate the Vclient performance on different hardware setups. The crucial task will be to evaluate our ability to adapt to different and changing network conditions of each user, whilst providing decent quality of service and experience.

FIGURE 6: SCREEN SHOT OF SAPO CAMPUS WITH INTEGRATED VCONECT CLIENT (IN RED <DIV>).


6 CONCLUSIONS
Despite the huge progress made in establishing video conferencing over the Internet in recent years, there are still remaining challenges to be addressed. Existing solutions like Google+Hangout are closely bound to their own web application, while open solutions like WebRTC do not inherently support group conferencing.
Two challenges have to be addressed for future solutions before they will become widely available on the web. Firstly, the architecture has to support scalability, i.e. the ability to serve 1000s of users in a cost effective way. Secondly, the API has to be simple and easy to use for web developers.
Considering scalability, the central media router architecture seems to be most promising because it makes pragmatic use of the available resources in the client CPU and downlink channels, while allowing low complexity server components as needed for cost effective operation for many users. Hence, we predict that this architecture will be predominant for web conferencing in the future.
Considering easy integration, we consider the low-level API of WebRTC to be too complex to be used for the normal web developer. This is because it is focused on point-to-point communication and needs significant extra logic for managing and controlling group conferencing. Topics like automated floor control (“Orchestration”) and video composition (“View-Modes” selection) have to be included in a middleware-layer between the web application and basic media transport. Such videoconferencing-middleware is already being developed as extensions to WebRTC, see tokbox, vline, and licode.
The Vconect project has developed an alternative videoconferencing platform, which addresses those two challenges. It is also based on the central router architecture and offers an easy API for web developers. Easy integration has been verified by smooth integration into SAPO Campus and user trials have shown a good user experience. The Vconect platform has proven particular useful in studying user experience and preferences as we have full control over Orchestration and video composition. In summer 2014 Vconect will have a final trial in which hundreds of users will use SAPO Campus with Vconect functionality for several weeks. The evaluation of the trial data will hopefully prove that the integration of videoconferencing with social networks provides added value to the users and that the Vconect platform can provide the required scalability to offer such a service in a cost effective way.
We conclude this paper with a self-critical look into the future of multi-party videoconferencing for the web and the role of Vconect in this future. Because of the huge interest and industry support of WebRTC it would be naïve to ignore this technology trend and rely solely on the Vconect development – a comparatively small effort with limited live time. In fact, the WebRTC middleware extensions towards multi-party conferencing which are currently being developed (tokbox, vline, licode, …) make it a very attractive solution for the content layer (i.e. media processing and transmission) and can replace similar functionality as provided by the Vconect browser plugin, see Fig. 7. The Vconect project did not use those middleware extensions because they have not been available at the start of the project (2011) and because the browser plugin allows the highest level of flexibility in implementing and studying intelligent composition and floor control, i.e. Orchestration. With a fresh look and current knowledge of WebRTC developments, the choice may now lean towards a WebRTC-based approach, especially when considering a commercial launch of a social network instead of an experimental platform for a research project.
However, Vconect can still add value in the reasoning layer with Orchestration, which has been the focus of work and main area of innovation. Therefore, as illustrated in Fig. 7, the Vconect API as defined in vConect.js may still be used as a higher-level API for web developers. Even though the content layer may be exchanged, the Orchestration logic and simple API can be reused.

FIGURE 7: POSSIBLE EVOLUTION OF VCONECT TOWARDS WEBRTC.

Acknowledgment
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. ICT-2011-287760.
In addition, we would like to thank the complete technical team who has contributed to the work described in this paper, but in particular the SAPO team, including Pedro Torres, Jorge Braz, and João Abreu.


REFERENCES

[1]     Home page of Vconect: http://www.vconect-project.eu/

[2]     Home page of Google Hangouts http://www.google.com/+/learnmore/hangouts/

[3]     Home page of WebRTC http://www.webrtc.org/

[4]     Home page of Apache activeMQ project http://activemq.apache.org/

[5]     Home page of the FireBreath project http://www.firebreath.org

[6]     Home page of SAPO Campus http://campus.sapo.pt/.

[7]     Home page of Full-HD Voice project http://www.full-hd-voice.com/

[8]     Information on AAC-ELD http://www.iis.fraunhofer.de/en/bf/amm/produkte/audiocodec/audiocodecs/aaceld.html

[9]     White paper on the Audio Communication Engine (ACE), available at http://www.iis.fraunhofer.de/en/bf/amm/download/produktbr.html

[10]  Home page of Vidyo http://www.vidyo.com/

[11]  Home page of Licode http://lynckia.com/licode/

[12]  ITU page for H.264 https://www.itu.int/rec/T-REC-H.264

[13]  RTP, Real Time Protocol. RFC http://tools.ietf.org/html/rfc1889

[14]  R.V. Prasad, H.S. Jamadagni, and H. N. Shankar, “Number of Floors for a Voice-Only on Networks – A Conjecture,” IEE Communications Proceedings, Vol. 151, No. 3, pp. 287- 291, 25 2004.

[15]  R.V. Prasad, R. Hurni, and H. S. Jamadagni, “A Scalable Distributed VoIP Conference Using SIP”, in Proc. 8th IEEE International Symposium on Computers and Communications (ISCC’03), 2003.

[16]  Mantis for OpenTok, see http://www.tokbox.com/blog/mantis-next-generation-cloud-technology-for-webrtc/




Nikolaus Färber received his Doctoral degree in 2000 as a member of the Image Communication Group, University of Erlangen-Nuremberg, Germany. He has published numerous conference and journal papers in the area of robust video transmission and has contributed successfully to international standard bodies, such as MPEG, ITU, 3GPP, and DASH-IF. After being a Post-Doc at Stanford University in 2001 he joined Ericsson Eurolab, Nuremberg, Germany as a member of the speech processing group. Since 2003 he is with Fraunhofer IIS, Erlangen, Germany, where he is heading the Multimedia Applications department.










Yaroslav Kryvyi received his Master degree in computer science in 2011 from Institute of Physics and Technology of National Technical University of Ukraine "Kyiv Politechnic Institute". He has aquired an extensive programming experience in the area of audio processing while working for companies like Bang&Olufsen A/S, Denmark and Avid Ltd, USA. Since 2012 works in Fraunhofer IIS, Erlangen, Germany, where participates in the Vconect EU project.











Vilmos Zsombori
is a Senior Researcher in the Narrative Interactive Media research group of the Department of Computing, at Goldsmiths University of London. He leads the development of the architecture and technical integration in the EU FP7 project Vconect and has made substantial contributions to the implementation of smart video communication in the EU FP7 TA2 project and interactive video narratives in the EU FP6 project NM2.

Ċ
Rene Kaiser,
Jun 11, 2014, 8:32 AM
Comments