Portal:Complex Systems Digital Campus/E-Laboratory on computational ecosystems as complex systems

An e-laboratory on Computational Ecosystems

Name, e-mail, website and institution
Responsible:

Matthieu Herrmann Klaus Jaffe Guiou Kobayashi Michael Kohlhase Krzysztof Krawiec Mathieu Leclaire Emmanuelle Leize Jorge Louçã Miguel Luengo Evelyne Lutton Nicolas Marilleau Fatima Oulebsir Boumghar Jérôme Pansanel, Pierre Parrend Noelle Potier Francisco Prieto Castrillo Romain Reuillon Camille Roth Thierry Savy Michèle Sebag Juan Simoës Eddie Soulier Djibi Sow Patrick Taillandier Carla Taramasco Sylvie Thiria Julie Thompson Souidi Zahira Saikou FALL

Coordination committee

 * Pierre Collet
 * Paul Bourgine

e-laboratory Scientific Committee
(to be completed)

Objectives
The objective of the CS-DC UNESCO UniTwin is to create a North-South-South and East-West network of scientists from all disciplines and institutions for embodying socially intelligent strategies in research and education on complex systems science.

Within the new science of Complex Systems, the objective of the CECS e-lab (Computational Ecosystems as Complex Systems) is to study and develop computational ecosystems made of many possibly heterogeneous machines, possibly running heterogeneous pieces of software in order to solve a single problem that could be either related to complex systems (reconstruction of multi-scale dynamics, simulation of a complex system, ...) or not.

The e-laboratory will organize the creation, sharing and integration of all resources for these strategies. The sharing of necessary resources will be done in the closest possible way with the other institutions having the same sharing goal. The CS-DC sharing rule is by default under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 for publications (except if people want to reserve commercial use or modifications or impose copyleft on modifications) or a status close to public domain for data like Creative Commons Public Domain Dedication if you agree that attributing data is too difficult to put it as a contractual obligation and believe scientific norm of citation would be sufficient. Copyleft licences or more sophisticated licences as described in the CS-DC By-laws can also be used. The main functions of the CECS e-lab is thus to share and integrate all the resources, to perform all the computations and to visualize all the results in the most intuitive way.

Challenges
Complex systems science is closely linked to computing for many reasons: finding a mathematical representation to model the multiscale dynamics of a complex system typically requires a great amount of computing power, because this implies using solvers that will repeatedly check how well the data produced by the model matches the (possibly big) observed data. Then, once a mathematical model is available, computing power is again needed to produce the multiscale dynamics behaviour to be compared with the observed data, and one needs to repeat this step many times if one wants to experiment the resilience of the complex system and determine what can be done to prevent extreme events from happening.

An e-laboratory focussed on computing is therefore central to the science of complex systems and because the amount of data storage and computing power almost always vastly dwarfs that of a standard workstation, a complex systems scientist will typically need to use the combined power of many computers in order to come up with interesting results.

Therefore, the challenge of the CECS e-lab is to develop and deploy whatever is necessary both in terms of software or hardware in order to come up with computing systems that can deal with large complex systems, bridging the gap ranging from the individual to the collective, from genes to organisms to ecosystems, from atoms to materials to products, from notebooks to the Internet, from citizens to society.

Because some large complex systems will require the combined power of several computers, one of the more advanced challenges of the CECS e-lab will be to put together computing ecosystems consisting of heterogeneous computers possibly executing different pieces of software, all working in a cooperative way towards the single aim of modelling a particular complex system. Because a set of different autonomous interconnected machines can itself be considered as a complex system, the ultimate aim of the CECS e-lab is to create Computational Ecosystems working together as a Complex System.

Such complex systems could exhibit emergent behaviour which, in the case of a computational ecosystem, could lead to observing super or supra-linear acceleration, provided that acceleration is measured in quality and not in quantity. Qualitative acceleration could be defined as how much faster a complex system made of n machines would find a result of the same quality as a single machine on the same problem. Computational ecosystems working as a complex system can be used to solve any kind of very large problems. Even if the problems are not related to complex systems, the fact that a complex computational ecosystem is elaborated to solve them makes this research fall in the scope of this e-laboratory.

All in all, the challenge for the CECS e-lab is to designing efficient large scale computing ecosystems as complex systems that could also be used reflexively to solve complex systems. For this to happen, many different elements need to be put together, involving data storage, communication, parallelization, security and robustness of both data and software, results visualisation, etc. All these elements can be the object of research projects of the CECS e-lab.

Advanced (Grid and cloud) computing ecosystem
Participants: * Carlos Jaime Barrios Hernandez * Francisco Prieto Castrillo * Pierre Collet * Ismaila Diouf * Aouaouche El Maouhab * Jérôme Pansanel * Romain Reuillon

References:

Advanced Computing is a paradigm that can deal with data intensive computations, addressed to scientific and academic interest in a first part, and second, for industrial and society needs with an important impact. Data intensive computing in e-research communities uses distributed and high performance computing systems, searching not only high performance, also usability and deployment. To support these activities, e-researchers can use Grid Computing platforms and today, cloud computing systems.

In LatinAmerica, for example, the High Performance and Scientific Computing community is supported by RedCLARA (www.redclara.net) and U.E., that provides two regional wide platforms: the GISELA Project (www.gisela-grid.eu) and the "Servicio de Computo Avanzado para America Latina y el Caribe" (Latin American Service of Advanced Computing for Latin America and the Caraibe) supported by RISC (Red Iberoamericana de Supercomputo) (www.risc-project.eu).

Both projects join computing resources in Mexico, Costa Rica, Panama, Colombia, Ecuador, Venezuela, Peru, Chile, Ecuador, Brazil, Uruguay, Argentina. In the case of the Colombian side, for example, the GridColombia Project sites and the Supercomputing and Scientific Computing Unit of the Universidad Industrial de Santander (http://sc3.uis.edu.co), provides the use of advanced computing platforms, as is the case of GUANE-1, the most powerful machine in Colombia based on NVIDIA GPUs (128 NVIDIA TESLA FERMI m2050 GPUs)

Europe supports the interaction in both projects, using resources in Spain (Particullary BSC-CNS and  CETA-CIEMAT  resources), France (LIG and I3S Laboratories) and Italy (INFN).

A GRID computing infrastructure can be thought as a federation of heterogenous computing and storage resources spread accross different institutional and administrative domains.

GRIDS are distributed networked systems mapping a series of services on the internet and, hence, their underlying topology can be rather complex. Furthermore, the coupling between the application workflow structure and the network topology has shown to play a fundamental role in the infrastructure performance. In this regard, the GRID places a technological challenge relevant to complexity science either for improving existing infrastructures or for designing new extensions and deployments.

In a typical GRID users interact with different components (services) through the user interface (UI). This machine is the access point to the infrastructure; at this point the authentication and authorization processes come into place. Then, users send jobs to the infrastructure which are handled by the Workload Management System (WMS). This service is allocated in a machine denoted as Resource Broker (RB) aimed to determine the best Computing Element (CE) and to register the status and output of jobs (the Information System is also involved in these operations). The process used by the RB for selecting a CE is known as match-making and is based on availability and proximity criteria. The CE is a scheduler within a grid site (denoted as resources centre) determining in which computing nodes (Worker Node WN) jobs will be finally executed.

GRIDS can dramatically reduce the makespan (overall execution time) for scientific applications if some adjustments and wrappings are implemented. In this regard, computing GRIDS are classified into High Troughput Computing (HTC) systems - large capacity for longer times- rather than High Performance Computing (HPC), where large amounts of computing power are used for short periods of time.

Regarding he UNESCO UniTwin program on Complex Systems, the benefits of using the GRID in the CECS e-lab group are twofold; first, students could learn the basics of a supercomputing infrastructure mechanisms and run complexity science based applications in a dedicated testbed. On the other hand, the GRID itself could be monitored and analysed as a complex system.

CETA (Extremadura Centre for Advanced Technologies) was deployed in Trujillo, Spain on 2005. Currently it is a multi-purpose research institution with an interdisciplinary team composed by computer scientists, engineers and physicists.

Along the last years, CETA-CIEMAT has been acquiring different sorts of computing resources for working and research in several distributed computing paradigms (e.g. Cloud computing, GPGPU (General Purpose Graphical Processing Units), High Performance Computing (HPC), etc. The distributed computing resources sum a total of 1566 core CPU, representing 11905 Teraflops. The GPGPU encompasses 39712 GPU cores, making a total of 113,33 32-bit Teraflops and 36,915 64-bit teraflops. The HPC unit includes 64 cores and 1TB RAM. The storage capacity available at CETA-CIEMAT sums a total of 694,4 TeraBytes.

Regarding GRID technology, CETA-CIEMAT has activiely participated in EU funded projects (EEGI, EDGES, EELA, etc.) In particular, CETA-CIEMAT is actively supporting all the Virtual Organizations (VOs) of the (National Grid Innititatives) NGIs offering computing, storage and access resources which could be addressed for GRID tutorials and online demos within the UNITWIN network. To use these resources CETA-CIEMAT would provided a user interface (i.e. the GRID access point) from where users could submit jobs to the tutorial infrastructure.

Cloud Computing resource sharing
Participants: *Romain Reuillon *Mathieu Leclaire *Carlos Jaime Barrios Hernandez

Grid computing has been successful in coordinating resource sharing at a mondial scale. However it poorly tackled the ease of use of the shared resources. Since then, the cloud computing concepts (* as a Service) have emphasis on transparency of access of remote resources: Infrastructure as a Service (IaaS) propose to instantiate clusters of virtual machines on demand, Platform as a Service (PaaS) propose to deploy application at large scale, Software as a Service (SaaS) propose to transparently use remote resource through application, Data as a Service (DaaS) propose to focus on the data instead of their locations... More generally, in cloud-computing conceptualizing x as a service means hiding the detail of access to x in order to focus on x.

The most well known cloud computing services are the private IaaS, SaaS and PaaS providers. That kind of private cloud is incompatible with the objective of the CECS e-lab. For instance using a private SaaS service for sharing content would mean to loose any freedom on the way this sharing is handled as well as on the shared data. The opposite should be enforced, the computational ecosystem should be an object of research and incremental evolution. Furthermore the knowledge it produce should not be locked in private format, software or services.

The components of a cloud-computing infrastructure for research and education should be based solely on free-software and open formats. The computing ecosystem should allow the user to export data and processing algorithms from the cloud ecosystem with no other restriction than enforcing the freedom of the content (i.e. share alike for CC or GPL). The exported algorithms and data should be usable outside the cloud ecosystem, notably for reproduction and falsification, which are prerequires from many scientific researches.

Hopefully some free (in the sense of freedom) components are available to built a cloud-computing infrastructure for research and education. For instance, the project Stratus Lab combines the advantages of the grid and the IaaS. It makes it possible to provision computer infrastructure on top of a mutualised collaborative computing ecosystem based on EMI (European Middleware Infrastructure). At the SaaS and PaaS level, the OpenMOLE project provide a transparent access to high performance computing environments resources while not locking the user processing on any single execution environment.

Autonomic Cloud and Grid computing
Participants : * Cécile Germain * Pierre Collet

The evolution of software infrastructures and computing equipment towards pooled systems on the scale of the Internet has led to these infrastructures becoming complex systems. A crisis of complexity has emerged in this field, which may become a major obstacle to the perspective of ubiquitous computing: integrating, installing, configuring, optimising and maintaining these service infrastructures are all challenges facing the industry, e-science and citizens. These challenges involve a growing human cost which the Grids and Clouds systems are testing, which should be seen as two totally compatible management modes of the same computing and storage resources. Autonomic Computing proposes a change of paradigm – systems capable of managing themselves using global strategic objectives defined by the designers and administrators. The biological connotation of the term is not unintentional – this ultimately involves introducing functionalities which allow living beings to function in harmony with their environment by means of a complex network of overwhelmingly decentralised mechanisms.

The challenges raised by Autonomic Computing are those of ubiquitous computing, which it is trying to resolve in an self-adaptive way. In general, there are four: self-configuration, self-optimisation, self-recovery and self-protection. Self-configuration appears to be the least-explored challenge – modelling is still in its early stages. The most recent data indicate that the technologies associated with Clouds (*aaS) are not making significant progress on the question of integrating heterogeneous professional software components (industrial or scientific) to create new, reliable and robust applications. Self-optimisation covers all the classic and recent fields of exploitation – placement-planning and data distribution as well as energy consumption management. Self-recovery corresponds to the general problem of distributed systems – identifying, diagnosing and processing localised failures; although fundamental limits exist in this field, there is enormous scope for progress in processing automation. Self-protection concerns failures and global assaults which create cascades of events in the system, and it is therefore in second line behind self-recovery.

More recently, the challenge of autonomic programming emerged, which takes a new look at DDDAS (Dynamic Data Driven Application Systems) objectives in the light of autonomic computing concepts – seeing applications as symbiotic feedback control systems, particularly for simulations or real time applications with a strong multi-scale dimension.

A paragraph on self dynamic workload management must be added for the exploitation of multi-scale parallelism.

Integrating/Visualizing knowledge
Participants: *David Chavalarias *Camille Roth *Jean-Philippe Cointet *Carla Taramasco

Academic papers published in various scientific publications platform (Medline, Thomson ISI, Arxiv, Microsoft Academic Search, etc.) or in other digital platforms will be exhaustively analysed to map past and current knowledge as well as to gather partial models and data for Complex Systems.

Scientific communities are moved by the complex intertwinements of social and semantic dynamics. These dynamics will be modelled at every levels. Reconstructing complex socio-semantic dynamics from textual traces first requires advanced linguistics skills to transform natural language content as well as relevant metadata into a realistic formalisation of the system. Common linguistic treatment such as part-of-speech tagging or stemming are required to abstract textual content which is by nature qualitative into a pertinent quantitative form. We also need to use more advanced NLP methods and text mining strategies to disambiguate concepts according to their context, normalize corpora (identification of distinct authors who share the same name), analyse the way past work are cited according to the implications conveyed by the context of citation (In which section, is a given work cited (experimental settings, introduction, discussion, etc.) ?- document structure analysis - Do authors criticize results from a cited articles ? - sentiment analysis).

Micro-level dynamics can be summarized in a purely relational framework: scholars are acquainted to concepts which can themselves be strongly associated to other concepts or references, journals, etc. From this raw heterogeneous network, on can draw a dynamical multi-network featuring “proximity relationships” between nodes: this abstraction of original textual data (+metadata) will be called the epistemic network. From a more geometrical perspective, one can consider that each node (whatever its nature) can be positioned in the space of probability distributions according to the joint use in the corpus of this precise node with any other nodes pertaining to the epistemic system. One can then easily elaborates on information geometrics framework and operators to derive distances between nodes, or even to measure nodes displacements in time.

The multi-level nature of the underlying socio-semantic system is addressed through the analysis of the structure of this epistemic network: characteristics macro patterns give birth to individuated high-level structures such as ermerging groups of co-authors forming specialized communities, groups of references jointly cited which define a shared cognitive paradigm, semantically close groups of concepts forming robust thematic fields whatever the underlying conceptual branching process at the micro level. Their high-level dynamics define a “phylomemetic” branching process which captures flows (of concepts but also of persons) between communities, explain the attraction of a given field, thematic dynamics, etc. We will elaborate on last recent advances in networks dynamics analysis (from scratch or a posteriori dynamical clustering, stochastic actor modelling) to reconstruct the continuous dynamics of research fields at different levels. Emergent fields will be automatically extracted and contextualized in larger cognitive trends (analysis of dynamical graphs as well as citation/co-citation information shall be used).

Reconstructed phylomemetic processes will give birth to the fully browsable and dynamical Integrated Knwoledge Map which will feature any types of entities such as laboratories, journals, concepts, types of concepts (genes, processes, diseases, etc.), theories, methodologies, authors, etc. These maps are designed as multi-level and intermediary objects with the view to develop a shared collective consciousness of scientific dynamics: it means that, depending on the level at which the map is browsed, one can access to a more or less synthetic description of the fields and extend local or global vision of the social and cognitive dynamics within these fields. Tuning the level of details of knowledge maps is a key features for users which may wish to access to different levels of precision according to their scientific background. At every level, automatically detected fields will be associated to raw sources (mainly articles, or possibly wikipedia articles, etc.), the access to which will be made directly available from within the knowledge maps web interface specially designed for the CECS e-lab. We expect that these tools will help structuring e-communities at work. This high-level characterization of the epistemic communities under study will be systematically commented by fields specialists and by science historians or sociologists (Science & Technology Studies). Indeed, we will call for a collaborative effort from the community to define the gold standard map as the description that summarizes the most realistically the necessarily partial observations made by every experts. These maps propose an augmented phenomenology of the epistemic communities that will then be enhanced through the interaction with users.

Transformation of ArXiv into MathML
Participants: * Michael Kohlhase

The task is dealing with the challenge to explicit the meaning of formulae and statements in theoretical corpora and to develop techniques for recovering salient aspects of the meaning of formulae and statements from presentation-oriented scientific papers and tutorial expositions of STEM knowledge. We will concentrate on recovering the operator trees of formulae and determining the logical structure of statements (theorems, definitions, and assumptions), so that the results can be used in compute engines, theorem provers, and STEM information retrieval.
 * Transformation of LATEX formulae in the arXiv into strict content MathML.
 * Determination of the scope and nature (universally or existentially quantified) of variables in mathematical statements by shallow linguistic techniques. This information is essential for the computational nature of equations and the retrieval of applicable theorems.
 * Reconstruction of the logical structure of statements: usually statements consist of various declarations and restrictions on variables, together with assumptions that relativize the assertion of the statement.
 * Transform formulae and assertion into the input languages of the computational engines used in the Rhapsody DC project. Furthermore, build a mathematical, formula-aware information retrieval engine for arXiv, Wikipedia, and PlanetMath and integrate it into CosyPedia. This knowledge will be used by the Autonomic Computing Ecosystem to feed the available partial models and constraints when trying to find the best model.

Permanent Observation of the Web
Participants: *Eddie Soulier *Veronique Benzaken *Jorge Louçã

Theseus is a python package developed by The Observatorium team, including several modules to deal with webpage retrieval and text processing. It includes python scripts that, together with bash scripts, are used for collecting and processing web pages for the observatorium databases. Theseus is and open source library free for research usage. This package allows to create web robots that can read and rewrite web pages.

Education Commons
Participants: * Danièle Bourcier * Pierre Collet * Michael Kohlhase * Mélanie Dulong de Rosnay * Carlos Gershenson

The Open Education Resource movement is supported by actors such as UNESCO and recommending to use an open license to distribute educational content, so that educators may translate, adapt and remix educational material, sometimes also including for commercial purposes in the case of private education or training. See examples here: http://wiki.creativecommons.org/OER

Recent declaration: http://www.unesco.org/new/en/communication-and-information/resources/news-and-in-focus-articles/all-news/news/unesco_world_oer_congress_releases_2012_paris_oer_declaration/

Technology has an impact on education as it allows to offer personalized, tailored learning materials. How will schools and universities incorporate this adaptive opportunity into their rigid structures? What will be the role of Massive Open Online Courses on traditional courses? They cannot replace them, but they do complement them .

Distributed Experimental hardware Platforms as Ubiquitous computing
Participants: * Pierre Collet * Emmanuelle Leize * Fatima Oulebsir Boumghar * Souidi Zahira * Nadine Peyriéras

The team "Remote sensing applied to forest" team brings to this e-laboratory a reflexion for the sustainable forest management of any region or country that need a clear status of the resource base. Effective management of forest resources, both public and private, requires reliable and timely information about their status and trends. Yet, the efforts monitoring existing  are failing to meet increasingly complex and large-scale forest management needs. New technologies may be able to satisfy the nation’s forest information needs. An important development over the past quarter-century has been the deployment of Earthobserving satellites and rapid improvements in computing power and algorithms to interpret space-based imagery. Remote sensing along with GIS and direct field measurements have shown the potential to facilitate the mapping, monitoring, and modeling of the forest resources.

Now that these technologies have been available for a significant period of time, how have they been integrated into forest monitoring practice and, importantly, exploited by decision makers?

Together with these questions, we bring to the e-lab part of our contribution to:
 * 1) show the different application of remote sensing in forest
 * 2) formally specify protocols from the forest observation for spatial data analysis and interpretation, ii) annotate and comment different logiciel available for treatment spatial data
 * 3) define, obtain and validate data observation, iv) contribute to mapping and modeling of the forest resources.

The “Multiscale Dynamics in Animal Morphogenesis” team brings to this e-lab the testbeds of the reconstruction of multiscale dynamics for investigating gastrulation processes in Deuterostomian species including the teleostean Danio rerio and the sea urchin Paracentrotus lividus. Embryonic development leading from the egg cell to the whole organism can be fully described by the spatio temporal deployment of the cell lineage tree annotated with quantitative parameters for cell membrane and nucleus shape. This data can be automatically extracted from in toto 3D+time imaging of the developing organism and allows answering most of the questions of classical embryology such as cell fate, presumptive organs clonal history, cell proliferation rates, contribution of cell division and its characteristic features to shaping tissues and organs. This is achieved by using the digital specimen corresponding to the validated phenomenological reconstruction of 3D+time image data sets. Exploring cohorts of individuals with different genetic and environmental conditions allows integrating the cellular and molecular levels of organisation. This approach is expected to serve as a basis for the reconstruction of multiscale dynamics to decipher emergent and immergent features at different levels of organization. Together with these biological questions, we bring to the project part of the biologists’ contribution to:
 * 1) formally specify protocols from the biological observation to data analysis and interpretation,
 * 2) annotate and comment data for active learning strategies,
 * 3) define, obtain and validate gold standard data,
 * 4) contribute to the supervision of the knowledge map elaboration, v) contribute to the design of the F-language and F-database.

Distributed Experimental software Platforms as Ubiquitous computing
Participants: * Veronique Benzaken * Julie Thompson * Jane Bromley * Renaud Vanhoutrève

The requirements for scalable software platforms in modern biology and translational science are indisputable, due to the very large heterogeneous datasets provided by public databases that are distributed globally. The management and analysis of all this data represents a major challenge, since it underlies the inferences and models that will be subsequently generated and validated experimentally. We propose to develop a new universal conception, called BIRD, for the development of a local integration system based on four fundamental concepts:
 * 1) a Hybrid flat file/relational database Architecture to permit the rapid management of large volumes of heterogeneous datasets;
 * 2) a Generic Data Model to allow the simultaneous organization and classification of local databases according to real world requirements;
 * 3) Auto-Configuration to divide and map one large database into several data model entities by declaring several configurations;
 * 4) a High Level Biological Query Language (BIRD-QL) to allow bioinformaticians and non-experts to search and extract data without requiring a deep understanding of database technologies and programming. This flexible approach could be applied for construction of searchable database shaving high level scientific functionalities in accordance with the specific scientific context.

A prototype BIRD architecture has been used to develop a software platform for the analysis and prediction of the phenotypic effects of genetic variations, thus providing an integrated environment for translational studies by genetics researchers and clinicians. The platform currently includes 3 main modules:
 * SM2PH (http://decrypthon.igbmc.fr/sm2ph): database of genetic mutations in all proteins, known to be involved in human genetic diseases. A wealth of interconnected information is provided for each disease-related protein, including data retrieved from biological databases using BIRD and data related to sequence, structure, function, evolution inferred from multiple alignments, three-dimensional structural models, and multidimensional (physicochemical, functional, structural, and evolutionary) characterizations of mutations. The annotated database is augmented with interactive analysis tools supporting in-depth study and interpretation of the molecular consequences of mutations, with the more long-term goal of elucidating the chain of events leading from a molecular defect to its pathology. The entire content of SM2PH is regularly updated thanks to the computational grid provided in the context of a French Muscular Dystrophies Association (AFM) program.
 * MSV3d (http://decrypthon.igbmc.fr/msv3d) and KD4v (http://decrypthon.igbmc.fr/kd4v): using the data provided in SM2PH, the relationships between genetic mutations, their effects on the structure and function of the associated protein and the clinical phenotypes can be investigated and predicted based on state of the art machine learning algorithms.
 * EvoluCode (http://lbgi.igbmc.fr/barcodes): Proteins in the cell do not function independently, but are involved in complex biological networks that are evolving over time. The study of the evolutionary histories of these networks gives clues to their cellular functions and dynamics. Unfortunately, the methodologies for representing and exploiting such complex evolutionary histories in large scale studies are currently limited. We propose to develop a formalism, called EvoluCode (Evolutionary barCode), to allow the integration of different evolutionary parameters in a unifying format and facilitate the multilevel analysis and visualization of complex evolutionary histories at the genome scale. The advantages of the approach have been demonstrated by constructing barcodes representing the evolution of the complete human proteome (~20,000 proteins). The formalism will open the way to the efficient application of data mining and knowledge extraction techniques in evolutionary studies of complex systems in biology.

Big data sharing
Participants: * Giuseppe Castagna * Véronique Benzaken

There is also the question of choosing the format in which data will be made available and also the languages used to manipulate this data. Although it is likely that every user will use her/his custom format, there is a clear need to translate these various formats into a common format so that data could be applied to the tools developed by the project. For such a common format the natural choice is XML [see XML], since it is a widespread international standard supported by all major actors of the IT industry. The project must not impose any particular choice for the language to manipulate the data as long as it can be input and output in XML format. Languages that deal natively with XML such as XSLT, XQuery—which are W3C standards—or functional languages such as CDuce, HaXML, etc., will all have a clear advantage with respect to languages in which XML can only be dealt via libraries or the data-binding approach (e.g., Java, C#, Scala, Python) and, therefore, their use will be promoted by the consortium. Among the former we will privilege languages with a static (XQuery, HaXML) and fine-grained (Cduce) typing, over the untyped ones (XSLT, XPath). Basic bricks of transformation can then be composed by using higher level language. W3C currently promotes as standard for the definition of such composition the language XProc [see XProc].

If necessary, the workflow bricks associated with solvers can be composed using process composition languages such as XProc. Whenever possible, we will prefer a static control of typing constraints, using languages where such control is very fine (Cduce) or less accurate (Scala, Xquery). The use of polymorphism and subtyping will obviously be an advantage. If one prefers to use untyped languages (e.g., XSLT, which is the processing standard of XML documents), then one can resort to a dynamic control of data format. This opens the possibility for each new data set of configuring the workflow bricks differently, or even letting the computing ecosystem do this itself.

Data and software security
Participants: * Pierre Parrend * Djiby Sow * Pierre Collet * Véronique Legrand

Main stakes in the domain of data and software security are the control of data localisation; the protection against malicious external users; the protection against untrusted administrator from third party sites.

A dual approach in the domain of security engineering and system engineering is required: Modelisation of the expected security is to be performed together with the users, so as to identify the main perceived and actual risks, and to validate the chosen properties on an abstract level.

System security audit and hardening is required on an implementation level, so as to ensure the middleware enforces and completes the security model. Both low hanging fruits (through automated audits) and architectural weaknesses (through manual audit of the system architecture) need to be addressed here.

New Open Data Creative Commons
Participants: * Danièle Bourcier * Melanie Dulong de Rosnay * Paul Bourgine

Such as publications, data can also be marked with licensing metadata following a semantic web approach in RDF/XML:
 * http://creativecommons.org/ns
 * http://wiki.creativecommons.org/License_RDF

Open Access policies recommend to distribute publications and data as soon as possible under Creative Common licence 1.0 (Attribution 1.0 Generic CC BY 1.0). But some may want to wait for a few months before publishing their data. It could be envisioned to design a mechanism to release after a period of embargo.

User Activity Capture and Quantified Self for Learning
Participants: * Eddie Soulier A lifestream is a time-ordered stream of documents that functions as a diary of your electronic life; every document you create and every document other people send you is stored in your lifestream. Lifelogging is the process of tracking personal data generated by our own behavioral activities. While Lifestreaming primarily tracks the activity of content we create and discover, Lifelogging tracks personal activity data like exercising, sleeping, eating and, of course, creating or using knowledge and learning. The term Lifelogging was coined by Gordon Bell who’s reason for tracking this information was to help optimize our behaviors by analyzing and learning from the data collected.

The Quantified Self movement takes the aspect of simply tracking the raw data to try and draw correlations and ways to improve our lives from it. Quantified Self is grounded on Big Data paradigm where data comes from social bookmarking, daily photo, events, micro-blogging, geo-location, computer usage, document sharing, gaming, and many other sources. It exists numerous lists of devices (gadgets & sensors), apps, web services and data aggregation services which are being created to allow us to Lifelog Learning Activities.

We propose to create a site as the base for a community of users who will collaborate to share self-knowledge, tools and interests around the topic of self-tracking linked with complex knowledge and learning. The site will be also a real-time web platform to create a Lifestream. And finally the platform will help in setting up the process of Long Term Digital Preservation.

Using all Scientific Software
Participants: *Karim Chine

Many talk about cloud computing, some try, yet only few succeed, since cloud computing follows a new paradigm which needs to be learned and understood. With public cloud computing a new era for research and higher education begins. Scientists, educators and students can now work on advanced high capacity technological infrastructures without having to build them or to comply with rigid and limiting access protocols. Thanks to the cloud's pay-per-use and virtual machine models, they rent the resources and the software they need for the time they want, get the keys to full ownership, and work and share with little limitation. In addition, the centralized nature of the cloud and the users' ubiquitous access to its capabilities should make it straightforward for every user to share with others any reusable artefacts. This is a new ecosystem for open science, open education and open innovation. What is missing is bridging software.

We propose such software to help data scientists, educators and students take advantage of this new ecosystem: R, Python, Scilab, Matlab, Mathematica, Spreadsheets, etc. are made accessible as articulated, programmable and collaborative components within a virtual research and education environment (VRE). The result is astonishing and requires some adaptation in the way we think: Teachers can easily prepare interactive learning environments and share them like documents in Google Docs; students can share their sessions to solve problems in collaboration. Costs may be hidden to the students by allowing them to access temporarily shared institution-owned resources or using tokens that a teacher can generate using institutional cloud accounts. This includes online-courses.

The same scenarios described above can involve researchers within a Collaboration, various domain-specific customization of the collaborative VRE are available and a Scientific Gateways rapid prototyping and delivery framework is included. The Elastic-R project Web Site : www.elastic-r.net

Using big data assimilation by models
Participants: * Matthieu Herrmann * Sylvie Thiria * Evelyne Lutton * Pierre Collet * Jonatan Gomez * Julie Thompson * Jane Bromley * Michèle Sebag

Natural Language Processing applied to legacy literature in biodiversity and agriculture fields - various data sets associated with this.

Data assimilation is a methodology pertaining to any problem which consists of taking into account observations (e.g. temperatures, atmospheric pressure, wave height, etc.) to correct imperfections in predictive digital models (ocean model, etc.). Within the framework of dynamic digital models, these observations may be data collected at the time of analysis or past data, and the observed data is used to correct the digital model’s predictions over time. From this perspective, the concept of DA is similar to that of the DDDAS as it consists of injecting data into an application during its execution in order to adapt it as well as possible to reality and better to take into consideration its associated incertitudes. Often, however, the reality (climatic or oceanic phenomenon, etc.) is a complex system as it involves matching several digital models representing different aspects (physical, chemical, biological, etc.) which exchange several types of information. DA enables these systems to integrate several sources of data in the course of their evolution so that they mutually self-correct themselves.

For this type of paradigm to become a reality, MP research needs to concentrate on the link between formulations and solution algorithms. With optimisation problems, as there are several different corresponding formulations, the pair (formulation and solver) must be found which best corresponds to the user’s wishes. Defining an abstract space containing these pairs, and the research techniques into this space both require the application of various fields of mathematics and computing, including logic, algebra, semantics and digital analysis.

The YAO project (http://www.locean-ipsl.upmc.fr/~yao/publications.html), has defined an experimental prototype for facilitating the conception, experimentation and implementation of 4D-Var (variational data assimilation method). Memory management and generation of massive parallel algorithms are now under investigation. The prototype will generate a high-performance source code for each specific assimilation method by extracting and implementing parallel instructions and tasks, in order to efficiently exploit the computer architectures. Significant advances have been recently achieved, pushed by the widespread adoption of multi-core processors and massively parallel hardware accelerators (such as General Processing Graphic Processing Units or GPGPU). They mostly rely on an algebraic representation of programs, known as the “polyhedral model”, which appears to be suitable for the parallelization of the computational kernels involved in assimilation problems. This capability will allow an increased use of assimilation methods. YAO will be used to develop data assimilation capabilities for models for which such capabilities do not exist at present and new application domains, such as hydrology, will be investigated. It is planned to make YAO available to interested scientific or industrial users.

The objective of machine learning (ML) is to identify data-generating models on the basis of observations (and, in some cases, a priori knowledge). Progress has been made over the last twenty years in two main areas. Firstly, statistical learning (Vapnik 95) concerns the way the quality of the model learned depends probabilistically on the quality and quantity of the data available, considering essentially the case of independent and identically-distributed data. Secondly, new paradigms (e.g. online learning, active learning, adversarial learning) explore the interaction between learning and the world that generated the data, which is non-stationary, possibly responding to interrogations, or adverse (Cesa-Bianchi and Lugosi. 2006).

Data mining (Han and Chamber 2000) has the same objectives, but with an emphasis on the storing and manipulation of huge quantities of data. The key criterion is scaling up, leading researchers to return to certain ML approaches, because of their good properties, and to develop them in a different spirit. These approaches include deep neural networks (Hinton 2006; Bengio 2006) and echo state networks (Jaeger 2001). Apart from the results (non-constructive) concerning the properties of universal representation of neuronal networks (notably as regards the modelling of dynamic systems), these approaches offer two different paths to overcome the optimisation difficulties underlying learning and data mining.

Up to recently, most machine learning techniques were oriented towards the discovery of hidden patterns or regularities in data bases. This was the realm of batch learning, where the learner could consider all of the data at leisure and produce a one-shot hypothesis, then, possibly, to be reset for the next learning task.

However, a whole set of new applications do not fall into this one-shot learning scenario, calling for adaptive learners that can process data on the flight and that can take advantage of past knowledge. For instance, in telecommunication, operators must adapt to their customers (slow) changes of habits as well as to (fast) varying conditions of operations. Electricity providers must detect changes in consumption as well as possible telltale signs of future rupture in their equipments. Even, when the data do not arrive in flows, the sheer size of the data forces a learner to consider only part of it at any one time and thus to realize incremental learning. It is obvious that the arriving of ubiquitous intelligence environments and the ever growing number of applications that require modelling complex evolving situations demand for new adaptive capacities in machine learning. The RAPSODY equipment cannot escape the necessity to be able to process never-ending sequences of data on-line.

This is in particular the object of data mining from data flows and of on-line learning techniques. One foremost challenge is to provide learning systems with a well-tailored memory of the past. New techniques seek to be able to adapt either the size of the memory or the weight of past data in order to ensure a proper trade-off between being able to take advantages of identified regularities and being ready to adapt to new trends and tendencies.

Among the most interesting new tools are a new breed of artificial neural networks (ANN) called reservoir computing techniques. They bring a new life to the artificial recurrent neural networks thanks to an automatic adaptation of the memory of past data (see [Lukosevicius and Jaeger, "Reservoir Computing Approaches to Recurrent Neural Network Training", Computer Science Review, March 2009]).

Artificial evolution can be defined as an optimisation approach based on populations and with a wide range of applications (Goldberg 2002). In terms of H. Simon’s contrast between simplifying until we arrive at a well-formulated problem and preserving the specificities of the research space and the desired solution (limited rationality), AE is clearly on the side of limited rationality. The strength (and the weakness) of AE lie in the complete freedom it offers the modeller in the modelling of the problem. The freedoms relevant to the field of complex systems include:
 * 1) the capacity to treat non-static optimisation problems, where the objective function evolves over time;
 * 2) the co-evolution of several populations in situations of cooperation or competition (whereby we can respond to problems where we have little a priori knowledge);
 * 3) the developmental approach, allowing to explore the space of solution generators in contrast to the space of solutions;
 * 4) multi-objective optimisation (Deb 2001), exploring the set of possible compromises between conflicting objectives (the Pareto front), and
 * 5) interactive optimisation, where a priori knowledge is taken into account extensively and implicitly (learning of the modeller’s preferences). More generally, one of EA’s essential contributions concerns the distinction between genotype and phenotype, allowing to look inside one space (genotype) for a solution that will be accomplished within another space (phenotype), where the transformation of genotype into phenotype reflects the environment’s influence.

Gama (Agent Based Modelling)
Participants: * Patrick Taillandier * Nicolas Marilleau

The last years have seen a clear increase in the use of agent-based modeling (ABM) in various scientific and application domains. This technique consists in modeling the studied system as a collection of interacting decision-making entities called agents. Each agent can individually assess its situation and makes its own decisions. An agent-based model can exhibit complex behavioral patterns and provide relevant information about the dynamics of the real-world system it represents. Moreover, it can be used as a virtual laboratory to test different policies.

While the classical KISS approach has given birth to a plethora of small-size, mostly theoretical, toy models at the early years of ABM, the current trend is to develop more realistic models that rely on large sets of data, in particular GIS data (Edmonds and Moss 2005). Unfortunately, for modelers that are not computer scientists, building complex, incremental, data-driven modular models is a difficult task.

The GAMA modeling and simulation platform (Grignard et al., 2013; Drogoul et al., 2013), developed since 2007 as an open-source project, aims at providing modelers with tools to develop and experiment highly complex models through a well-thought integration of agent-based programming, GIS data management, multiple-levels representation and flexible visualization tools. GAMA provides a complete modeling language (GAma Modeling Language - GAML) and an integrated development environment that allows modelers to build models as quickly and easily as in NetLogo (Tisue and Wilensky, 2004) while going beyond what Repast (North et al., 2013) or Mason (Luke et al., 2005) offer in terms of simulated experiments. It is currently applied in several projects in environmental decision-support systems, urban design, water management, biological invasions, climate change adaptation or disaster mitigation.

Image analysis, pattern recognition, and computational vision
Participants: * Mathieu Bouyrie * Fatima Oulebsir Boumghar * Krzysztof Krawiec

The cost of image acquisition in virtually all domains and application areas is steadily falling. As a result, computer systems and data centers are being flooded by image data of all sorts and modalities: medical imaging, security monitoring and surveillance, remote sensing, etc. Handling this flood is a challenging task for at least two reasons. Firstly, such data are by definition bulky at can come in large quantities at a time. As an example, let us mention that optical coherence tomography (OCT), a relatively new medical imaging technique widely adopted in ophthalmology, can produce imaging data at speeds up to 2GB/second. Secondly, retrieving meaningful knowledge from these data is difficult, as it involves aggregating information scattered over many image elements (pixels or voxels), and discovering higher-level patterns in them, which often requires complicated, iterative algorithms. For instance, segmentation of brain structures in a single magnetic resonance image (MRI) can take a day of intense computation on a high-end PC.

Therefore, image data requires technological advancement (spacious storage, high computational power) as well as methodological competence, ranging from computer science to various application domains (e.g., medicine). We envision the CS-DC initiative as an effective platform that is capable to provide for both these aspects. In particular, in cooperation with befriended research centers we will investigate the possibility of implementing advanced image analysis and medical decision support algorithms for optical coherence tomography (OCT). OCT is a noninvasive in-vivo technique that produces three-dimensional high-resolution (down to 1-2 micrometers per voxel) images of retinal structures in real time. Within this activity, we plan to develop novel algorithms for, e.g., effective and robust image enhancement, artifact removal, segmentation of anatomical structures, blood vessel detection, and blood flow estimation. For some of these tasks, it will be probably desirable to engage non-conventional processing platforms, like graphical processing units (GPU), because image processing algorithms (particular low-level image processing) can be usually conveniently and effectively paralleized. The developments in this area can have substantial impact on the current practices in ophtalolmology, in particular help OCT supplant older, invasive imaging techniques like fluorescein angiography. Apart from that, OCT is occasionally applied in other branches of medicine and biology, for instance to investigate the aftermaths of strokes (neurology) or study the movements of cytoplasm in isolated cells.

Discrete Augmented Phenomenology
Participants: * David Chavalarias * Camille Roth * Jean-Philippe Cointet * Carla Taramasco * Idriss Aberkane

The object of the Creage Project is to show contents in perspective in order to improve the knowledge flow (Web, database, journal articles, etc.)

Mapping contents onto space facilitates their accessing and memorizing. When Pannini mapped views of Modern Rome on a virtual space, he understood already the power of spatializing contents to make more available to the mind at a glance.

Creage is a prototype software to spatialize contents and in particular web and scholarly contents, but beyond Pannini, in a dynamic way. We are developing a web 3D interface that allows users to tap into their spatial memory to access, remember and exchange contents: videos, articles, web pages, talks, courses, etc. As global knowledge is doubling every nine years there is an immediate need to increase the global knowledge flow: how can researchers access, mentally manipulate and remember more contents? Representing contents in a dynamic, multiscale 3D Natural User Interface, making knowledge exchanges social, spatial and putting them in perspective is what Creage is about.

(screen captures will follow soon)

Cosypedia as part of the Wikipedia consortium
Participants: * Yohan Dréo (consultant)

Both for worldwide scientific challenges and courses for research and education in the science and engineering of complex systems

As the main collaborative distant working tool
Both for research and education

Measuring the satisfaction of Cosypedia Users
Participants: * Guiou Kobayashi

Both for research and education users

Organization of Worldwide Contest
Participants: *Miguel Luengo

For research

Serious games for virtual training on multiscale stochastic dynamics
For education

Complex Systems Collaboratorium (CSColl)
Participants: * American Center for Strategic Studies CEE-IAEAL-USB http://www.cee.usb.ve/acss.html * Grupo de Investigacion en Sistemas Complejos: https://sites.google.com/a/usb.ve/gisc/home * Laboratorio de Evolución, USB: http://atta.labb.usb.ve/

Only with interdisciplinary knowledge will we be able to tackle complex problems. This knowledge is beyond the realm of a single specialist or a single institution. This poses a challenge for educational programs aiming at forming a generation of researchers, better prepared to tackle complex problems. Computational aids to allow several researchers in different parts of the world to discuss, support, interview, teach, tutor, advice and interact with students of different countries and from different disciplines will certainly be required to successfully implement any improved advanced educational platform appropriate for complex system research. CSColl aims is to enable international communal tutoring of students by creating a virtual educational atmosphere analog to that found in advanced research groups where interactions among senior scientists, novel researchers, post doc´s, PHD students and undergraduate students favors the transmission and creation of knowledge that fuels advanced scientific research.

The main objective is to allow for interdisciplinary research and advanced education in complex systems by expanding the pool of knowledge available to a student or researcher in his academic environment to a much broader and international community of students of Complex Systems. CSColl is an e-laboratory in complex systems and participates in the e-laboratory on Computational Ecosystems in order to develop tools for its functioning. See e-laboratory on Complex System Collaboratorium

URL for the Website and/or Wiki of the e-laboratory

 * page at Wikiversity

Grid, Cloud, or other network utilities to be used

 * (insert the text here)
 * (insert the text here)

Data and/or Tools to be shared

 * Node of the National Network for advanced EPR Spectroscopies (TGE Renard CNRS, http://renard.univ-lille1.fr/).