On July 1, to mark Canada Day, three Canadian choirs participated in a master class given by the First Vice-President of the World Choral Federation (Maria Guinand of Venezuela). The choirs were located in Alberta, Nova Scotia, and Toronto; I was among the singers. Our instructor, together with an audience of 100, was in St. John's. There were two live cameras in each location, making eight cameras in all. The network was of course CANARIE.
The experience reminded me of the old saw about an elephant that joined the corps de ballet. "What was remarkable was not that the elephant danced well, but that it danced at all."
The master class made the most stringent demands on the technology: to be effective, sound reproduction had to be CD-quality or better, and latency had to be low enough as to be imperceptible—not to mention the complexity of the network hookup itself.
During the session, one site was dropped, but restored within five minutes. There were a other momentary lapses. But on the whole, In fact the network performed well.
The video conferencing was another story. Two-way latency was a second or more. Motion was jerky, with frequent freezes of half a second to a couple of seconds. With that delay, it was impossible for the instructor to conduct a choir's singing. And audio was so inferior that our instructor exclaimed, "Well, obviously we're not going to pay any attention today to how you sound!"
The plan was for the three choirs and the St. John's audience to end the broadcast by singing "O Canada" together. But with a second of latency multiplied several times over, this broke down completely, with the four locations coming in on each phrase over a span of several seconds. Singing dissolved into laughter—we had just celebrated Canada Day by demonstrating the regional disharmony that bedevils the Canadian body politic.
Our instructor congratulated the technology experts on demonstrating a technique that shows great promise for bringing together peoples of the world, “even if there are still a few wrinkles to be ironed out.” But what are the wrinkles? What will it take to produce real-time videoconferencing of a quality that could make such a session a real success?
John Riddell, Editor, Telemanagement
‘More Clear-Path Bandwidth’ Is Needed
Most of today's commercial video conference systems are designed for simple "talking head" applications with limited multi-point interaction. Using video conferencing for coordinated music such as multiple choirs, or virtual orchestras is very much pushing the envelope of today's video conference technology. However there is considerable promising research being carried out in this area, as for example at McGill University, working in partnership with CA*net 4 (Canada’s national optical research network).
The central problem faced by current systems is the compression of the video signal and the synchronization of that signal to the audio that is performed by off-the-shelf video conferencing systems. That's where you see the one second and higher delays.
The key to the solution is to fully utilize the network bandwidth and send uncompressed audio and video over the link. Then you only have to deal with delay due to the speed of light and speed of switching through the network.
Jeremy Cooperstock and John Roston at McGill University have developed special “Ultra Video” technology to minimize the problem (http://ultravideo.mcgill.edu). They are world leaders in this technology.
Together with McGill’s Wieslaw Woszczyk, they have demonstrated that musicians can play with a slight delay as long as that delay is constant and under about 200 milliseconds. Used the right way—that is, with dedicated lightpath capability—CA*net4 can help achieve those characteristics across North America. The McGill researchers have been successful with small groups of jazz musicians between Montreal and California but can do much better over shorter distances.
The McGill team continues to innovate. They are hoping to show multi-streams of bi-directional high-definition TV between Montreal and Seattle at the November SC2005 conference on high-performance computing. That demonstration will have a music teacher in Seattle giving a Master's class to a jazz ensemble, located in Montreal.
Getting a choir coordinated in a large performance space requires a leader due to the slow speed of sound through air. The problem is compounded when you try to do this with multiple choirs separated by hundreds or thousands of kilometers, but it has to be solved in a similar way. There has to be a central coordination point (the conductor) and then a separate mix of the sounds from each site which is fed back to the ears of each member of the choirs. This isn't your usual video conferencing set up!
One of the biggest problems for multi-point video conferences is that they typically pass through a single point for redistribution. The box is a Multipoint Control Unit or an MCU. If that machine is poorly connected or CPU- or backplane-challenged, then you'll see a bad video conference even if the end-points have good cameras, excellent encoders and great network connectivity. So a modern MCU is a good first step.
It is really hard to handle echo problems (echo cancellation) in any general way. There are good cheap solutions for voice in a small room, and good expensive solutions for voice in a big room, but for music especially in a large space, it is very difficult.
Variability is the big villain in video conferences. If you are sharing the line with other IP applications at any point in your network, you could run into problems where the packets from the video conference are delayed or even lost. This will cause bad video, pops in the audio and even dropped connections. So you probably don't want a Grand Challenge physics project sharing your network while you sing O Canada.
Three Lines of Research
There are currently three key areas where IP-based video conferencing development is taking place:
1. The Ultra Video non-compressed video and surround sound work at McGill, currently being upgraded to high-definition video. Other groups around the world—e.g. in the Netherlands, Japan, Korea, Australia and the USA—are working on variations on this theme. The idea is to see how close to being there you can get if you remove constraints on bandwidth and keep the latency (delay) to a minimum. Typically 1.5 gigabits per second is necessary for each stream; McGill is currently working with 3 bi-directional streams. Needless to say these experiments are still very expensive.
2. DV (Digital Video) encoders/decoders of the type used for consumer camcorders can be used as a cheap entry into fixed speed compression video with stereo audio. Because of compression, there's often an unavoidable delay when using these, but it is constant and predictable, which is better than some of the commercial video conferencing systems. Also the audio quality is significantly better, especially for music—most commercial VC systems are optimized for voice (which really is no surprise). Typical streams are about 30 megabits per second. The Internet2 in the U.S. has an active user group using technology originally from Japan.
3. Access Grid (AG) is yet a third approach to VC that is still developing. The idea is that each system has a wall-sized display (approximately 3072x768 pixels), usually produced by at least three projectors. Each site generates at least three views of its room for transmission to other sites. A high-speed, multicast-enabled network connects all sites. Each site has an operator to optimize the placement of these video streams on the screen. Current work aims to integrate better quality video and audio systems into the Access Grid framework, as well as producing tools to more easily manage the system. Australian researchers seem to be leading in this although the technology base was originally developed in the U.S.
The whole VC landscape is complicated by the fact that most current IP-based systems area based on H.323, which in Internet terms it is clunky, cumbersome and hard to deal with. SIP-based systems are appearing a fast rate due to the simplicity and openness of the protocol.
None of the three technologies described here uses SIP yet. But as they move toward a common call setup, it is likely that SIP will be adopted.
In the long term, business communications is likely to borrow from each of the above approaches. It will have the quality of the McGill HD/Surround Sound streams, the low cost of the DV camera-based systems, and the management structures and meeting place ideas from the Access Grid work. And certainly less compression will be involved and more clear-path bandwidth required.
—Peter Marshall, Director of Network Applications, Canarie