DataJoint is an open-source library for managing and sharing scientific data pipelines in Python and Matlab.

DataJoint allows creating and sharing computational data pipelines, which are defined as databases and analysis code for executing steps of activities for data collection and analysis. For example, many neuroscience studies are organized around DataJoint pipelines that start with basic information about the experiment, then ingest acquired data, and then perform processing, analysis, and visualization of results. The entire pipeline is diagrammed as a graph where each node is a table in the database with a corresponding class in the programming language; together they define the data structure and computations.

DataJoint key features include:

  • access to shared data pipelines in a relational database (MySQL-compatible) from Python, Matlab, or both.
  • data integrity and consistency based founded on the relational data model and transactions
  • an intuitive data definition language for pipeline design
  • a diagramming notation to visualize data structure and dependencies
  • a serialization framework: storing large numerical arrays and other scientific data in a language-independent way
  • a flexible query language to retrieve precise cross-sections of data in a desired format
  • automated execution of computational jobs, with built-in job management for distributed computing
  • managed storage of large data objects outside the database

Project Author(s)

Dimitri Yatsenko; Edgar Walker; Fabian Sinz; Christopher Turner; Raphael Guzman

This post was automatically generated by Dimitri Yatsenko