Marian F. Ursu, Goldsmiths, University of London
Peter Stollenmayer, Eurescom GmbH
Doug Williams, BT
Pedro Torres, Portugal Telecom (SAPO)
Pablo Cesar, CWI
Niko Farber, Fraunhofer IIS
Erik Geelhoed, Falmouth University
This article introduces the Vconect project. Vconect (Video Communications for Networked Communities) is a collaborative European research and development project dealing with high-quality enriched video as a medium for mass communication within social communities. The technical capabilities where Vconect innovates concern: high quality a/v capture, dynamic a/v composition, network resources optimization and communication orchestration. The project is driven by two main use cases. The first focuses on the integration of live video communication with social networking services. The second focuses on distributed performances, their automatic representation to remote spectators and the support for social interaction around such performances.
Video communication is growing in popularity, with major manufacturers, such as Microsoft/Skype and Google, targeting the consumer end, group communication, and more comfortable modes of use. This is not at all surprising, as people naturally need to see and hear each other when they are in conversation. Body language and voice intonations are an essential part of the communication, in many instances more important than the words themselves. Furthermore, being able to show each other the objects and events about which we are talking, through video snapshots and audio recordings, significantly improves the quality of the conversation.
Social networks have taken the world by storm, having already become an intrinsic part of our social fabric. For example, in April 2014, Facebook reported 900 Million unique monthly visitors and Google+ 120 Million (http://www.ebizmba.com/articles/social-networking-websites). This is not at all surprising, as people are social beings: we need to form communities and interact with each other. Community interaction and belonging is an intrinsic part of our sheer existence and well-being. We have to share thoughts with each other, help, teach, play, or simply engage in idle chat.
Until recently, these two natural and complementary forms of mediated social interaction were more or less separated from each other. However, advancements have been made in integrating them, for example by Facebook, which has integrated Skype and is now offering point-to-point live video communication, and Google+, which is offering an integrated video group communication, via Hangouts. However, more is left to do for their full integration. As a matter of fact, “integration” is probably not the right term, as new forms of mediated social interaction could emerge by taking the two paradigms as starting points.
This paper presents Vconect, a project that focuses on the development of new forms of social interaction through live video, but also considering their integration with social networks.
Taking the perspective of live video communication only, Vconect considers two tightly-linked great challenges:
The former challenge refers to the ability of the overlay network to dynamically adapt, in aspects such as active cameras and flexible modes of compositing and mixing the video content on each screen, to respond to aspects such as number of simultaneous users, the roles they have in the communication, the ability to separate into subgroups and have parallel conversations, the ability to deal with larger groups interacting with each other from different locations (see the figure 1 above for illustration).
The latter challenge refers to the ability of the overlay network to dynamically adapt in aspect such as encoding parameters (bit rates, resolution, formats, etc.) and transmission routes (to minimize delay) and to perform partial composition in the network, both view a view to optimizing the communication experience together with the cost of the network operation.
There are various issues generated by each of these two challenges. However, they ought to be addressed concomitantly, if robust and effective technical solutions are to be developed.
The Vconect project (http://www.vconect-project.eu) is an example of such an endeavor and the remainder of this paper will presented challenges as considered and solutions as devised by Vconect.
2 THE VCONECT PROJECT
The Vconect vision is the adoption of high-quality enriched video as a medium for mass communication within communities.
Vconect is building a video communication platform which models and supports the complex communication topologies that characterize conversations between group members. The system takes intelligent decisions to mediate the communication at the level of audio-visual choices, screen layout and network capabilities. Vconect is ensuring the wide applicability of the platform by implementing, testing and evaluating it in the context of two different use cases. The first use case is based on the integration of video experience into social networking services, the second on group mediated performance.
Vconect’s technological challenge is to develop components which enable a service-aware network. They must work together to intelligently and dynamically optimise network and media processing resources to satisfy the changing requirements of group conversations in communities. The requirements for high quality audio and video and low latency, which are inherent in a high quality experience, make this challenge even more demanding.
Vconect is advancing the state of the art in the following areas:
3 THE PERFORMANCE USE CASE: USING SMART VIDEO COMMUNICATION TO SUPPORT INNOVATIVE MULTI-SITE THEATRICAL PERFORMANCES
Despite the increasing fidelity of recorded performance, be it of musicians, dancers or actors, live performances retain an enduring appeal. At the same time the nature of performance continually evolves, responding to the affordances provided by technology. Two particular trends include that of live streaming performances from theatres to cinemas, so that stunning performances can be enjoyed by people unable to travel to, or be accommodated within, the venue housing the performance - Live Streaming of theatrical performances was pioneered by The Metropolitan opera in New York and has been embraced by amongst others, the UK’s National Theatre (NT Live, http://ntlive.nationaltheatre.org.uk/) and by the Royal Opera House (www.roh.org.uk). The second involves the audience moving between a number of performance spaces to find and appreciate different elements of the performance that they have to subsequently assemble to create their own stories from which they can derive meaning - Punch Drunk have “pioneered a game changing form of theatre in which roaming audiences experience epic storytelling inside sensory theatrical worlds” (www.punchdrunk.com).
Vconect seeks to work at the intersection of these two; working with Cornwall based Miracle Theatre Company we are deploying our technology with a performance of the Shakespeare play The Tempest adapted such that the story is told through performances that take place in two separated venues, each with their local audience. Smart video communications technology from Vconect will provide: a means for the performers in the two locations to communicate with each other; for the actors in each location to be aware of both the local audience and the audience at the remote venue; and for the audiences in both locations to be aware of the performances in both the local venue and the remote venues and to allow a home based audience to enjoy a streamed version of the performance synthesised from the audio and video captured from the two performance spaces.
Vconect technology is being deployed to support this use case that involves multiple cameras and multiple screens at each performance space. In addition new technology is being developed that will enable smart video communications to operate effectively in this complex scripted context. Tools are being developed to enable the writer to provide directorial instructions that can be translated into a machine readable instruction set that will control which audio video signals are selected for transmission, how they transmitted, and how they are displayed at each viewing location.
This deployment will allow Miracle Theatre Company to explore the challenges and opportunities associated with such a multi-site performance and will help them to pursue their goal of defining new genres of performance that mix theatre and film. At the same time it will help Vconect to develop tools that are well suited to the workflow required for theatrical performance.
The performances are scheduled for September 2014 and will take place in Cornwall in the UK.
4 THE SOCIALISATION USE CASE: INTEGRATING SMART VIDEO COMMUNICATION INTO A SOCIAL NETWORKING SERVICES
The general use case targets the integration of real-time video communication with more asynchronous communication on social networks. It is formulated to understand how real-time video communication could enhance social networking forms of group communication, and to explore the dynamics of their relationship. For example, the way people migrate from one form to the other, how they (re)use resources between platforms, and what could be extracted from these interactions in order to improve the overall quality of the communication experience. Inherently, this case will explore more complex topologies of real time video communication than those supported by existing SoA systems, for example considering more communication nodes or subgroups, the latter referring to the ability of having a conversation within a smaller group whist still having presence in a/the larger social group.
This use case will be implemented through ”SAPO Campus”, SAPO being a brand of Portugal Telecom and Campus its platform for social media based learning, which targets schools and universities and can be used by both teachers and students. On top of the usual social networking features, like activity feeds of status updates, comments, forming groups, sending private messages, it includes SAPO’s well-known services like blogs, photo, video and file sharing. Importantly the system also enables some of the functionalities of a learning platform, the ability to set homework, to submit homework and to keep track of a student’s progress. The content and applications are fully owned and managed by the community with their own branding. SAPO Campus is different in nature from large well-known social networks as people in the same instance of the Campus are part of the same institution, providing a strong sense of community and responsibility. Moreover, it empowers schools as content providers giving them a public face and a single hub for their content.
A Vconect enabled video communication capability will be integrated with SAPO Campus, allowing seamless transfer between real-time video communication and the other currently supported forms of communication.
A key challenge for live video communication disclosed by this use case is the ability to support ad-hoc groups. Different groups, at different times, may initiate real-time video communication sessions between their members. To simplify the description, let us follow one of these groups only. Its members are video-connected, but at the same time they may have active links with other users on SAPO Campus. At the same time, the group may be visible as being engaged in a video communication. Other users, if allowed, may join the conversation. Existing members may leave. The main topic of conversation may branch off into a number of subthemes. They may generate the formation of subgroups, each subgroup being intensively engaged in conversations around its topic, but having the ability to “have a presence” in and maybe even “keep an eye on” on what is going on in the main group. As users can join and leave as they wish, this process leads to dynamic clusters that “travel” across the community, like swarms. Facilitating the communication needs in such dynamic structures is a challenge to address – this will be done under the heading of orchestration. Developing overlay media network optimisation techniques and the associated configurable media rich processes, necessarily required by such configuration, is another challenge – this will be done under the headings of service aware network and configurable a/v processes.
Actively connected members may decide to illustrate their points with sample media, which they access through the existing SAPO Campus interfaces. Combining the real-time video conferencing feature between groups with the ability to share time-based media is another challenge generated by this use case.
Finally, metadata extracted from the integrated communication platform (live video and SAPO campus) can inform its decision making process, such as the way recorded media is suggested or the way the real-time video communication is orchestrated.
Through this summer (2014), Vconect is conducting a trial aiming at evaluating the integrated communication platform and the users’ communication experiences it can support.
Vconect aims to provide an open web interface similar to WebRTC, which provides group video conferencing capabilities similar to Google Hangouts. The first successful implementation based on the Vconect API has been completed and the open API has been made available at Codebits 2014 (https://codebits.eu/).
5 EXPERIMENTS TO UNDERSTAND THE MAIN REQUIREMENTS OF INTEGRATION OF VIDEO EXPERIENCE INTO NETWORKING SEVICES
Complementing the research driven by the two use cases described above, Vconect also explores the development of main technical capabilities through experimental enquiry. Three are summarised below.
A View-modes and orchestration
A desktop screen could be organised such as to accommodate different video windows of different sizes in order to make the group conversation as easy and fluent as possible. Different such views will suit different people depending on the context. In some situations it may be good for all participants to see each other all the time; in other cases it may be best to see just one other person in full screen mode. As conversation roles change, it is very likely that a particular window should be used to show more than one group member, this leading to the requirement of mixing streams from different locations. Not only this, but the actual layout organisation may have to change in time, as the communication contexts change. We refer to the layout organisation of the live-video windows on the screen as view-mode. Orchestration is the process that decides how content from different sources should be mixed in each particular window, depending on the conversation context, as well as choosing a particular view-mode. An experiment was carried out that compared communication experiences in three main view-modes (see figure 5 below):
A total of 54 volunteers were employed in this experiment, of which 18 were females (mean age 18.24, SD = 4.07) and 36 males (mean age 20.31, SD = 7.54). 16 were aged between 14 and 16 and 31 were aged between 17 and 30. There was a high predominance of participants that were under 20 years of age, as they were recruited from local secondary schools, colleges and universities.
Initial conclusions based on cluster analysis suggest that the Mosaic view mode is better suited to supporting fast turn taking, providing a sense of group cohesion whilst at the same time supporting the conveyance of the individuals’ presence to each other (group presence). The Full Screen view mode appeared to be better suited to supporting slower-paced communication instances, of a more intimate nature, in which where there is mostly one person talking and an audience can see the facial expressions of the speaker in great detail. The Unbalanced Mosaic view mode is an interesting compromise between the other two more extreme cases, providing both for faster paced conversations and group presence, as well as slower paced conversations of a more intimate nature. Obviously, though, it also fails to provide best results in either of the two extreme cases. A more detailed description of the experiment and its finding is currently submitted for publication. Future experiments will also explore the effect of dynamically changing the view-modes through orchestration.
The three view-modes experimented with are currently supported also by Skype and the latter two by Hangouts. Nevertheless, on one hand, there is an open space of solutions when it comes to implementing the orchestrated behaviour of the system within these modes – i.e. how content is mixed within each window and how the view-modes change in real-time. On the other hand, there is also an open space for solutions regarding the way in which the network transmission is being optimised. In fact, it is the collaboration between these two reasoning processes that raise many research questions and provide space for innovation.
B Social communication in living room setups
Video communication is not just about the desktop and Vconect is aiming to explore other communication setups such as, for example, sit-back communication via TV screens and multiple cameras that could cover living room spaces.
Vconect carried out an experiment based on a communication between three typical living room setups, each equipped with multiple cameras and a large TV set. Participants (6 per session) were invited to prioritise the qualities of their dream holiday and home in an informal context. They experienced two conditions:
A total of 24 volunteers took part in the experiment, of which 18 were female (mean age 22.89, SD 3.86) and 6 were male (mean age 22.83, SD 3.25) and they were all students at Goldsmiths, University of London.
The participants reported some very interesting effects of the difference between the orchestrated and static conditions. Many enjoyed the feeling of intimacy that emerged from seeing the detail of the close up shots but, at the same time, felt that a different segmentation of the communication space had occurred. For example, they considered that an intimate conversation was possible through the orchestrated condition, but not through the static split-screen condition. However, when the conversation was animated the split-screen condition was preferred as sometimes, when the rhythm of the communication was too fast, automatic mixing – that is orchestration – could not keep up with it. The more detailed description of the experiment and the corresponding results are currently being prepared for publication.
These insights combined with those gained from the View Mode experiments are suggesting that orchestration needs to optimise the need for seeing the active speakers as well as providing for group presence, but also for other, more subtle, aspects of the conversation. At the same time, orchestration should dynamically control the screen layouts (e.g. mix split screens with full screen).
C Virtual microphone
The aim of this work is to capture the signals of a remote distant speaker using discreet arrays of static microphones. This is achieved through complex signal processing.
The functional design of the signal processing system capable of developing a “virtual” microphone using signals from an array of microphones has been built and tested in lab conditions. A recording of a man and a woman speaking together in the same room (not to each other just speaking at the same time) was played through loudspeakers. Algorithms to generate virtual microphones have been evaluated using a variety of internal parameter settings. The system assesses diffuseness and direction of the audio sound scene in time/frequency-space using the signals of the two arrays of microphones as input. The signal processing then attempts to reconstitute an audio signal from a given location in the sound scene.
Video communication for social groups is a large essentially unexplored domain. Initial solutions for simple setups have been built, particularly by Microsoft Skype and Google+ Hangouts, but these represent the early steps on this vast unexplored landscape. Vconect represents yet another step forward in this exploration. Vconect considers more complex communication topologies and is working on the development of a number of core technical capabilities, regarded as essential in supporting more complex communication structures, namely: encoding of high quality audio and video, dynamic audio-video composition, and automatic decision making processes able to adapt the communication infrastructure to the dynamic needs of social communication.
Vconect is a collaborative European R&D project within the European Union’s Seventh Framework Programme for Research and Technological Development. It receives funding from the European Community's Programme under grant agreement no. ICT-2011-287760.
We would like to thank all partners (in alphabetical order) – Alcatel-Lucent, BT, CWI, Eurescom, Falmouth University, Fraunhofer IIS, Goldsmiths, University of London, Joanneum Research Forschungsgesellschaft, and Portugal Telecom – for their inputs and comments.
http://vconect-project.eu), LeanBigData (http://leanbigdata.eu) — on ultra scalable big data algorithms — and AppsForEurope (http://www.appsforeurope.eu) — on open data incubation and turning open data into viable businesses. Previously, he was a research fellow at Goldsmiths College, University of London, working for over two years on knowledge representation and automated reasoning in the FP7-funded TA2 project (http://www.ta2-project.eu) delivering video-conference to the home and, before that, he was a research assistant at Imperial College London in the areas of machine learning, automated reasoning and computation creativity.
Pablo Cesar leads the Distributed and Interactive Systems
group at CWI (The National Research Institute for Mathematics and
Computer Science in the Netherlands). He has (co)-authored over 50
articles about multimedia systems and infrastructures, social media
sharing, interactive media, multimedia content modelling, and user
interaction. He has given tutorials about multimedia systems in
prestigious conferences such as ACM Multimedia, CHI, and the WWW
Nikolaus Färber received his Doctoral degree in 2000 as a member of the Image Communication Group, University of Erlangen-Nuremberg, Germany. He has published numerous conference and journal papers in the area of robust video transmission and has contributed successfully to international standard bodies, such as MPEG, ITU, 3GPP, and DASH-IF. After being a Post-Doc at Stanford University in 2001 he joined Ericsson Eurolab, Nuremberg, Germany as a member of the speech processing group. Since 2003 he is with Fraunhofer IIS, Erlangen, Germany, where he is heading the Multimedia Applications department.