The Miljø DATABASE project will consist in the design of a database structure to store and manage in the best way in CBDA's Hadoop computational infrastructure all kind of data provided by Uni Research Miljø. Moreover, Machine Learning algorithms can be implemented by CBDA to perform further analyses and also visualization results and tools can be provided and customized.
A well-organized data structure shall help the researchers at Miljø to easily find the information they need and keep track of all the data available. By exploiting different kinds of Machine Learning techniques, CBDA can then provide new insights in the research, opening up further collaborations with Miljø and possibilities of joint grant applications. The same holds for modern visualization tools (e.g. using libraries such as plotly, matplotlib, bokeh etc.), which not only can give a clear picture of the meaning and semantics of the data but also unravel new aspects and topologies which may have escaped prior analyses and knowledge.
The project roadmap is as follows:
- Acquire the available data from Miljø. The formats of the data are extremely variable, from time-series to measurement parameters tables, to multimedia content such as pictures and videos.
- Once the data is obtained, it has to be ingested in the Hadoop ecosystem and standardized as much as possible, using a master list of parameters and anthologies, making it possible to produce cross-field analysis (e.g. by mixing biological data with climate data), with hopefully much greater power to the interpretation and understanding. The format of the database has to be discussed, the raw storage in the HDFS can be for example in the form of ORC tables or OpenTSDB time-series or a combination of both. This initial ingestion/preprocessing will be a core task and it will most likely require the creation of ad hoc scripts and work flows customized for each of the different input sources and formats.
- Access to the database must then be provided and may happen in different ways, among which a web interface with login mask and personalized features for the users. Also, notebooks such as Zeppelin or Jupyter can be employed to provide an assisted mean of accessing, modifying and performing analysis.
- Requested operations and analyses will be performed, with any sorts of relevant plots, tables, files etc. as output. At this stage, Machine Learning will be deployed to combine the different sources and hopefully extract new meaningful variables/features which may be hindering in the data semantics. Visualizations by request (or not) will also be of great value here.
Practical example: the Fish PIT-TAG sub-project
One of the first collaborations will regard the data from a fish-tagging and antenna-tracking system deployed by Miljø at stations in Arna and Vosso, both in sea and rivers . Fish in controlled pits is tagged and released in the wild. A system of antennas recollects later on (on a long time spans, months and years eventually) the signals of tagged individuals passing close. This should follow fish life-cycles and migrations timescales (from ocean to rivers and viceversa) and trace a picture of the movements of the different species. CBDA will ingest and crunch data from both tagging and antenna pings: the two kinds have to be automatically compared and matched to give back the IDs of the individual which have shown up again. This of course must keep track of any kind of relevant metadata and be displayed 'grouped by' all different requested parameters. Any kind of plots, tables and results of further analyses can be discussed with Miljø and delivered. Reference researchers involved at Miljø will be Bjørn Barlaup and Shad Mahlum.