Nikolaus Färber and Yaroslav Kryvyi
Multimedia Applications Department
There is increasing interest in videoconferencing for social networks, but web developers are facing two basic challenges: Easy integration and scalability. The latter can be solved by a scalable architecture based on central media routers, which makes pragmatic use of the available resources while allowing low complexity server components. Easy integration, on the other hand, requires a simple API with inherent support for multipoint and intelligent audio and video composition. In this paper we describe how the EU-funded research project Vconect has addressed these two challenges and compare the results with Google+Hangout and WebRTC. As a prove of concept, the Vconect platform has been integrated with SAPO Campus, a social network operated by Portugal Telecom Group, with positive results from initial trials.
Videoconferencing in the web has made huge progress in the recent years and has become more and more popular among users of social networks. The most prominent example is Hangouts in Google+ , which has evolved into a very attractive and feature-rich service. However, Hangouts is tightly coupled to Google+ and cannot be used easily in other social networks or small web services. Especially when a tight integration with the web application is desired, developers cannot make use of Hangouts. As an alternative to Hangouts, WebRTC promises easy integration into web services by making it a part of web browser and HTML5. However, WebRTC is focused on point to point communication between browsers and does not support multipoint conferencing directly. Although WebRTC can be extended to address multipoint conferencing, the resulting solutions require additional control logic to be implemented, which is typically beyond the scope of a normal web developer. Hence, it is still difficult to integrate videoconferencing easily into any web application: Hangout cannot be used outside Google+ and WebRTC is not supporting multi-party conferencing in a convenient way. The EU-funded research project Vconect  has developed an alternative approach to video conferencing, which tries to fill this gap and make the integration and implementation as easy as possible for web developers. This approach is not using WebRTC or Google Hangouts but is based on an independently developed browser plugin. The goal is to make video conferencing as simple as integrating an image with the <img> tag in HTML. All the web developer has to care about is placing a <div> at an appropriate position on his web page, allowing the Vconect platform to take care of the logic behind and thus providing a good user experience. Vconect will consider the given size of the <div> and select an appropriate video composition and control strategy. This includes e.g. detecting who is currently speaking and enlarging his screen area. This intelligent control and presentation of a multi-party conversation in a mediated environment is termed “Orchestration” in Vconect, which is a key research contribution of this project. In this paper, however, we focus on the underlying architecture for media processing, i.e. media codecs and transmission. Special care has been taken to make effective use of network resources and keep the complexity of server components low, such that scalability to a high number of users becomes possible at low operational costs. In addition, we provide an overview of the Application Programming Interface (API) for web developers because we believe that easy integration is essential to the adoption in web applications.
2 SCALABLE VIDEOCONFERENCING ARCHITECTURE
There are several approaches to enable multi-party videoconferencing which impose different requirements on the client, server and network infrastructure, see Fig. 1. For example, clients can establish a fully meshed network in which each client establishes a peer to peer session with each of its N participants. Although this approach does not require any server infrastructure (except for session setup) it puts a heavy load on the upload channel because each client has to send N copies of his stream. In addition, each client has to decode N streams and mix those before playout, which increases its complexity.
The classic approach in Telecommunication is that of a Multipoint Control Unit (MCU), which acts as a central bridge in a star topology. This architecture is convenient for the client, because it can treat the MCU as a normal client and therefore establishes a normal point to point call. Therefore, only a single encoding and decoding instance is needed and also the channel usage in the uplink- and downlink remains unchanged. The drawback of this architecture is the complex MCU, which has to do encoding, mixing, and re-encoding for each of the N+1 clients. This can become a problem for the operation of the platform as it becomes expensive to run, e.g. when each session requires its own dedicated hardware server. This is particular true for web platforms with 1000s of users which are not willing to pay extra money for the service (if any at all).
A good compromise between these two architectures is a central media router, which balances the requirements for the client and server in a pragmatic way. While maintaining a star topology, it replaces the bridge with a low complexity media router. Technically, this media router is implemented on the application layer and runs on a server with high bandwidth access. However, as the application running on this server is mainly forwarding packets, and therefore the term media router, or just router, is used. The key is, that the complexity of the router is minimal compared to the bridge. The drawback is a higher downlink bitrate and client complexity because each client receives streams from N participants and does decoding and playout similar as in the peer-to-peer approach. However, it turns out that a higher download bitrate is often acceptable in practice because of the asymmetric requirements in Internet traffic. For example ADSL links often have a ten times higher downlink speed than uplink speed and also 3G/4G mobile networks are much more limited in the uplink than in the downlink. Besides, the additional client complexity of N times decoding is also acceptable, because decoding is less complex than encoding and can be well handled by modern PCs. Hence, the central media router architecture makes pragmatic use of the available resources in the client CPU and downlink channel, while allowing low complexity server components as needed for cost effective operation for many users.Because of the given advantages, the central media router architecture has been adopted successfully in the industry, for example by Vidyo , which has to be credited as one of the pioneers. However, the basic idea has already been proposed for audio conferencing earlier in 2003 by Prasad [14, 15]. The same approach can also be implemented in WebRTC when using special media routers [11, 16]. Also the Vconect project has decided to follow this approach as described in the next section.
3 VCONECT ARCHITECTURE AND COMPONENTS
The overall architecture and components of the Vconect platform are illustrated in Fig. 2 and are described in more detail below.
A. Client Side
The client side includes components for capturing, coding and transmitting audio and video. These components are largely equivalent to those found in conventional video conferencing systems. Encoding is based on H.264 video  and AAC-ELD  audio. IP encapsulation is based on RTP . Video is encoded in multiple resolutions and bit rates, such that adaptation to the network and client resources becomes possible. Which layers are transmitted into the network and forwarded to remote clients is decided by the Reasoning Layer. Audio processing is based on the Audio Communication Engine (ACE), which is a VoIP engine handling AAC-ELD coding, IP streaming, and echo control in a single module . As required for the central media router architecture, the ACE can decode multiple audio streams and mix those before playout. This layer also incorporates the Video Composition Engine (VCE) which receives and composes multiple decoded video streams for presentation to the user.
Moving up to the Analysis Layer, the Vconect client includes components for the automated analysis of captured audio and video streams, from which cues can be generated as an input to the Reasoning Layer. As a result, the Vconect platform can e.g. detect which participant is currently talking and show his video enlarged.
All of the above components and functionality is integrated into a browser plugin, such that integration into web applications becomes as easy as possible.
B. Server Side
At the Content Layer, two main components provide scalable transmission of audio-visual streams: the Video Router (VR) and the Audio Router (AR).
The VR is an efficient packet switch and replicator which connects multiple source video streams to multiple client targets. It does not implement any media processing but only forwards RTP packets. However, it acts as an end-point for RTCP  and can therefore monitor the network state on all links of the session. This network monitoring information is forwarded to the Reasoning Layer which is responsible for network optimization. This optimization is however out of scope for this contribution in which we focus on the media coding and transport, i.e. the Content Layer. Because the VR and AR are required to have a public IP address, they can also assist in NAT traversal and avoid the need for a full STUN/TURN/ICE implementation. In essence, the VR and AR act as an inherent TURN-relay and therefore simplify the overall architecture.
The AR behaves very similar to the VR but selects M out of N audio streams to be forwarded to each client, where M≤N and typically set to M=2 or M=3. This follows the idea proposed in  in which it is observed that in a normal group conversation there are very seldom more than 3 participants speaking at the same time. This is a result of behavioral rules which are followed implicitly by humans in conversations. For example, if the current speaker is interrupted by another participant in the group, he will not continue to speak but give the other person a chance to take the floor (at least intermediately). Exploiting this behavior, the AR limits the audio bit rate on the downlink to M times the bit rate of a single speaker. In order to select the M most active speakers, a low complexity energy estimation algorithm is used which does not require full decoding of the audio packets. Hence, the required complexity in the AR is much lower than normal audio decoding and can basically be neglected.
The Communication Layer includes the Session Manager, which is the hub of the communication framework in Vconect, enabling messages to be transmitted between components, and enabling users to find each other and join a Vconect session. While each videoconference session has its own instance of a Communication Manager (not illustrated in Fig. 2), the Session Manager is the central entry point for all clients starting new sessions.
Another important aspect for a state-of-the-art video-conferencing platform is its ability to run the server components in a commercial cloud platform. This allows scaling the service in an efficient way. In case of the Vconect platform, the AR has already been instantiated in the Amazon EC2 and work for the remaining components is ongoing.
4 DEVELOPER API
A clear and simple Application Programming Interface (API) is crucial from the perspective of a web developer who wants to integrate group videoconferencing into his web site. Without loss of generality, we assume that this web site implements a Social Network (SN) with the required server infrastructure. The goal is to start a videoconference in a similar way as embedding an image using the HTML <img> tag. But instead of seeing a JPG image on the web site, the user will see a live videoconference and be able to participate in the conversation.
To illustrate the overall architecture and message flow between the components we define four basic components as illustrated in Fig. 3.
1. SN Client
The Social Network Client is what the user sees and experiences when being on the social network. It is the web application running in the browser of the user and is typically implemented using HTML/CSS/JS. We assume that there is an existing web application that shall be extended with videoconferencing.
2. SN Server
The Social Network Server comprises all server components of the social network. This is where e.g. all user data is stored (pictures, messages, profiles) and the HTML content is served from. While the user is logged in the social network, the SN Client is talking to the SN Server through the SN-API, e.g. using HTTP and AJAX. This interface is out of scope for the Vconect platform but needs to provide some basic functionality, such as exchanging user identities.
3. VC Client
4. VC Server
The Vconect Server comprises all server components of the Vconect platform, including video routers, audio bridges, and session management. All communication between the Vconect Client and Server is aggregated in the VC-API. This includes all media streams and control messages. Those are out of scope for the web developer, i.e. he does not have to care about H.264, AAC, RTP, RTCP, etc.
There are two interfaces which a web developer has to consider for integrating a videoconference into a social network, i.e. the Client API (C) and Server API (S). The social network platform will have to use both to establish a videoconference.
1. Client API (C)
2. Server API (S)
The two server systems communicate directly to each other through the Server API (S-API). For example, the SN Server will ask the VC Server to create a new session, i.e. a new video conference. This interface is using REST/JSON as a communication protocol. Currently it is focused on session creation/deletion and can provide information on running sessions and their participants.
The implementation of the SN-API is up to the social network platform and is not of concern to the Vconect platform. We simply assume that the relevant information can be exchanged, e.g. using AJAX. Similarly, the details of the VC-API are hidden from the web developer. He does not have to care how media is routed and which control messages are transmitted in the VC back end. Table 1 lists all methods on the Client-API and Server-API with a short description.
The starting point on the server side is creating a session through the createSession request; the VC platform allocates all required resources in the back end and returns a sessionID, which is comparable to the number/PIN of a conventional phone conference. It is in the responsibility of the SN platform to distribute the sessionID to all clients who want to join this session.
The corresponding methods to leave a session and delete a session are vConect.stopClient() on the Client-API and deleteSession on the Server-API. Besides those and basic microphone control for muting (vConect.setMicrophoneOff(), vConect.setMicrophoneOn()), the only remaining API call to consider is the setting of view-modes.
A view-mode is a high level layout style, which describes the basic video composition and orchestration approach. As illustrated in Fig. 4, there are three view-modes which can be selected by the web developer and/or user. In view-mode Tile, all participants are displayed simultaneously in tiles of equal size. There is no orchestration active as all participants are always visible. In view-mode Clean-Cut only a single participant is shown on the screen at any point in time and his video is scaled as big as the given <div> allows. The orchestration makes sure that the person who is currently talking is shown. The view-mode Standard, is a compromise between the above two options and very similar to Google+ (Google Plus) Hangouts. It shows the person who is currently talking in a larger video while still displaying the other participants in small videos (“thumb nails”). The transition during turn taking is animated. The corresponding methods are listed in Tab. 1.
The described API is an interim specification and subject to change. However, the high level API as described above is expected to be relatively stable and mature. A lot of effort is needed to make the underlying technology work robustly in difficult network conditions. But this development is happening “under the hood” and will not affect the API.
5 DEPLOYMENT IN SAPO CAMPUS
The Vconect platform has been successfully integrated into SAPO Campus, a social network operated by SAPO, the Internet service provider of Portugal Telecom Group . Currently focusing on schools and universities as closed user communities, SAPO Campus brings together all of SAPO’s social core services into a single out of the box web application. On top of the usual networking features, like activity feeds of status updates, comments, groups, private messages, it includes SAPO’s well-known services like blogs, photo, video and file sharing. All fully owned and managed by the community, with their own branding. SAPO Campus is very different in its nature from larger, better-known social networks, because people in this network are members of the same institution, providing a strong sense of community and responsibility.
SAPO Campus is currently in product deployment stage with potential schools being contacted and given workshops on how to use it. It is already being used by 30+ schools in a pilot program and plans are to have it deployed in a large number of schools by the end of 2014.
For the integration of Vconect-based videoconferencing, SAPO Campus has introduced the concept of Rooms as constant objects like Math or Geography classrooms at school as shown on Fig. 5. From Vconect’s perspective, a Room is just another name for a session created by the SN. Users can join a Room after having signed up for the class. After having been approved as a class member by an administrator, users can enter the Room and participate in the video call. This way Campus does its user management and provides VC only a client id. Each Room displays a list of members as a drop down menu highlighting also their online or offline status. All this functionality can be implemented by utilizing the simple API described above.
It is worth mentioning that the integration of Vconect went faster than SAPO’s development team anticipated, which allowed initial trials already in 2013. Fig. 6 displays a historical screen shot of the first successful and fully functional videoconference in SAPO Campus.
After the integration, Vconect had successful trials in SAPO Campus, involving 25 users. During each of the five sessions, five participants had to solve three different tasks in three different rooms, while there were always at least two of them in the same room. Besides checking the overall stability and doing basic beta testing, the main goal was to study the interplay between social network and videoconference usage.
Some results of the trials were surprising because they contradicted our assumption that the Standard view-mode would be the most preferred among the participants. However, most of the participants preferred the Tile view-mode, which resulted in the conclusion that in small groups of participants there is less need for orchestration. The general feedback was very positive, resulting in comments such as
I loved the discussions and the interaction. We talked on day without the Internet?
Very interesting – the different chat rooms. Very nice to go in and out.
Many participants, who were all already users of SAPO Campus at their university, asked
When can we use it in our campus?
which encourages us in the continuation of our work.
Vconect has scheduled bigger trials for summer 2014. More SAPO Campus users will be offered to use Vconect for an extended period of time, i.e. for several weeks. Hundreds of simultaneous users will generate significant load on the server infrastructure, providing us with data on the efficiency of our multimedia transport and processing, as well as evaluate the Vclient performance on different hardware setups. The crucial task will be to evaluate our ability to adapt to different and changing network conditions of each user, whilst providing decent quality of service and experience.
Despite the huge progress made in establishing video conferencing over the Internet in recent years, there are still remaining challenges to be addressed. Existing solutions like Google+Hangout are closely bound to their own web application, while open solutions like WebRTC do not inherently support group conferencing.
Two challenges have to be addressed for future solutions before they will become widely available on the web. Firstly, the architecture has to support scalability, i.e. the ability to serve 1000s of users in a cost effective way. Secondly, the API has to be simple and easy to use for web developers.
Considering scalability, the central media router architecture seems to be most promising because it makes pragmatic use of the available resources in the client CPU and downlink channels, while allowing low complexity server components as needed for cost effective operation for many users. Hence, we predict that this architecture will be predominant for web conferencing in the future.
Considering easy integration, we consider the low-level API of WebRTC to be too complex to be used for the normal web developer. This is because it is focused on point-to-point communication and needs significant extra logic for managing and controlling group conferencing. Topics like automated floor control (“Orchestration”) and video composition (“View-Modes” selection) have to be included in a middleware-layer between the web application and basic media transport. Such videoconferencing-middleware is already being developed as extensions to WebRTC, see tokbox, vline, and licode.
The Vconect project has developed an alternative videoconferencing platform, which addresses those two challenges. It is also based on the central router architecture and offers an easy API for web developers. Easy integration has been verified by smooth integration into SAPO Campus and user trials have shown a good user experience. The Vconect platform has proven particular useful in studying user experience and preferences as we have full control over Orchestration and video composition. In summer 2014 Vconect will have a final trial in which hundreds of users will use SAPO Campus with Vconect functionality for several weeks. The evaluation of the trial data will hopefully prove that the integration of videoconferencing with social networks provides added value to the users and that the Vconect platform can provide the required scalability to offer such a service in a cost effective way.
We conclude this paper with a self-critical look into the future of multi-party videoconferencing for the web and the role of Vconect in this future. Because of the huge interest and industry support of WebRTC it would be naïve to ignore this technology trend and rely solely on the Vconect development – a comparatively small effort with limited live time. In fact, the WebRTC middleware extensions towards multi-party conferencing which are currently being developed (tokbox, vline, licode, …) make it a very attractive solution for the content layer (i.e. media processing and transmission) and can replace similar functionality as provided by the Vconect browser plugin, see Fig. 7. The Vconect project did not use those middleware extensions because they have not been available at the start of the project (2011) and because the browser plugin allows the highest level of flexibility in implementing and studying intelligent composition and floor control, i.e. Orchestration. With a fresh look and current knowledge of WebRTC developments, the choice may now lean towards a WebRTC-based approach, especially when considering a commercial launch of a social network instead of an experimental platform for a research project.
However, Vconect can still add value in the reasoning layer with Orchestration, which has been the focus of work and main area of innovation. Therefore, as illustrated in Fig. 7, the Vconect API as defined in vConect.js may still be used as a higher-level API for web developers. Even though the content layer may be exchanged, the Orchestration logic and simple API can be reused.