Proposition Stage/PhD/Master thesis

Compilation of a DSL based on transducers to SIMD optimized programs

Efficient Data Processing

Streaming data processing is a crucial approach that focuses on traversing data to extract pertinent information. Applications ranges from network packet manipulation to analysing DNA. Modern data-processing tools heavily depend on efficient implementations that harness hardware acceleration to achieve high performance. This acceleration can sometimes be achieved through automatic compilation, but frequently demands expert developers to craft optimizations by hand.

One critical facet of this optimization process involves SIMD optimization, where data is packed into chunks and processed with minimal branching in the code, often using bitvector operations. These optimizations are at the core of numerous well-known software applications, such as regular expression matching in tools like ripgrep, JSON parsing in libraries like SimdJSON, and even fundamental operations like string encoding and decoding (unicode parsing). Developing these optimizations requires a broad skill set and is a testament to the expertise of programmers worldwide.

Exploring a Restricted Programming Language in this Internship

During this internship, we will explore the design and implementation of a specialized programming language for stream processing and its compilation to efficient SIMD code. The technics will take inspiration of real software design (such as rsonpath) and will be based on abstract automata theory and logic approach. Initially we will focus on a limited expressivity class, named LTL, whose theoretical properties are well understood.

Following this initial exploration, depending on the interest of the student, we will focus either on: - Extending our design to other language constructs that admit efficient vectorised implementation - Formally study the link between our language and finite state transducers and their expressivity - Output real-world efficient code in existing instruction sets.

Tasks

Location

This internship will be co-advise by Gabriel Radanne (INRIA Lyon) and Charles Paperman (Université de Lille) and can be located either in Lille or in Lyon or in both depending on the choice of the student.

Candidate profile

The candidate should ideally be familiar with formal approaches in programming language design, notably type systems, semantics, and logic. From the practical point of view, a basic experience in software programming and usage of collaborative tools such that git. This internship strongly relies on the fact that practical implementation should have strong theoretical foundations and that further refinements of the theory should get inspiration from the practical side. We expect the candidate to agree with this philosophy

Bibliography

  1. Parsing Gigabytes of JSON per Second
  2. Supporting Descendants in SIMD-Accelerated JSONPath
  3. Validating UTF-8 in less than one instruction per byte
  4. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs
  5. Vectorial languages and linear temporal logic.
  6. An algebraic approach to vectorial programs

Compiled the: mer. 08 janv. 2025 11:51:56 CET