Topics for new thesis (tentative list)


Thesis can be developed either in Alameda or Taguspark campi. All thesis have the possibility of a research grant.

Interested? Drop me an email and we can talk about these (or other) topics.


 

LazyFork: Lazy page migration for even faster remote forks on serverless cloud functions based on CXL shared memory

In the context serverless cloud computing, the recently proposed CXLfork interface [1] promises substantial speed-ups using CXL-attached shared memory [3] for cluster-wide process cloning.

Still, CXLfork still depends on the latency incurred by page migrations from the CXL memory to the local DRAM tier, which are implemented by the traditional Linux move_pages system call.

This thesis will explore the opportunity of optimizing the page migration system call by employing a lazy approach when the pages being migrated are read-only pages. To achieve this, the thesis will borrow techniques from lazy page management, which have been previously proposed in other contexts [2].

 

[1] https://dl.acm.org/doi/10.1145/3676641.3715988

[2] https://www.cs.yale.edu/homes/abhishek/kumar-asplos18.pdf

[3] https://dl.acm.org/doi/10.1145/3613424.3614256

 

A comprehensive experimental study of recent proposals for memory tiering in emerging CXL-based memory architectures

The recent emergence of new heterogeneous memory architectures (such as CXL-based memory systems) has sparkled a diverse set of papers proposing techniques for data placement over such architectures.

These proposals typically are evaluated with different sets of workloads and on different systems. Moreover, most proposals are experimentally compared against the mainstream alternatives (such as Linux' AutoNUMA, TPP [1] or libnuma's default placement policy), but not compared across each other.

This incomplete and inconsistent cross-proposal evaluation landscape makes it hard to understand which proposals are actually more competitive, under different scenarios.

The main goal of this thesis is to fill in this gap. The thesis will depart from the open-sourced implementations of the most relevant proposals published in top-tier scientific conferences/journals over the last years.

It will then perform a comprehensive experimental comparison of such alternative systems using a diverse set of workloads (from different domains, from LLMs to HPC workloads) and hardware configurations, using a real system with real CXL memory devices.

It is expected that this comprehensive experimental evaluation will draw new insights about the effective advantages of the different options in the design space of this domain. These insights will guide a new design that will combine techniques from different proposals into a new design.

[1] https://dl.acm.org/doi/10.1145/3582016.3582063


Introducing fast transactions to in-kernel eBPF-based storage services

The emergence of the safe kernel extension framework Extended Berkeley Packet Filter (eBPF) [1] in Linux has ignited interest on system-level capabilities. One of the most exciting uses of eBPF is to enable in-kernel distributed storage services [2,3,4], which are able to serve requests even before the execution of the standard network stack. This has been shown to provide substantial latency and scalability improvements when compared to traditional storage services. However, implementing a storage service on eBPF has crucial restrictions.

Not surprisingly, most available proposals for in-kernel storage only support limited key-value interfaces, with no support for transactions. However, many applications require the rich semantics of transactional interfaces.

In this thesis, we will study whether existing in-kernel storage services can be extended to support transactional operations by relying on the support that latest-generation Intel Xeon server CPUs provide for hardware transactional memory (HTM).

More precisely, the thesis will use take advantage of a previous thesis, which has extended eBPF to expose HTM primitives to eBPF programs [5], and explore the incorporation of such features into a state-of-the-art in-kernel storage service (e.g., [1] or [2] or [3]).

[1] https://ebpf.io/what-is-ebpf/

[2] https://www.usenix.org/conference/nsdi21/presentation/ghigoff

[3] https://www.usenix.org/conference/osdi22/presentation/zhong

[4] https://www.usenix.org/conference/nsdi24/presentation/zhou-yang

[5] https://fenix.tecnico.ulisboa.pt/downloadFile/2815144904098320/95659-pedro-duarte-dissertacao.pdf

 

Revisiting the persistent software transactional memory for the emerging CXL shared memory pools

The rise of CXL 3.0 [1] has giving new hope to the paradigm of Distributed Shared Memory (DSM) systems. This thesis will focus on CXL 3.0-based systems in which the CXL fabric is composed by persistent memory. Such systems define a Partial Failure Resilient DSM (RDSM) [2].

To build programs that access persistent data structures on top of RDSM, programmers need effective programming abstractions. In the past decade, the research community has devoted substantial attention to the abstraction of persistent memory transactions, and proposed many efficient software-based transactional memory (STM) implementations of such abstraction [3].

However, such implementation depend on a full-system failure model, which is substantially different from RDSM. This thesis will revisit the design space from the most relevant proposals for persistent STM [3] and study how it can be adapted to support the more challenging RDSM model.

[1] https://computeexpresslink.org/wp-content/uploads/2023/12/CXL_3.0-Webinar_FINAL.pdf

[2] https://dl.acm.org/doi/pdf/10.1145/3600006.3613135

[3] https://www.dpss.inesc-id.pt/~jpbarreto/data/uploads/bib/pm_survey_csur2021.pdf

 

Faster concurrent CXL-based checkpointing for machine learning in spot VMs

With: Rodrigo Rodrigues

Training large-scale machine learning (ML) models is so time-intensive that the probability of failures is high enough to make the cost of restarting failed jobs a considerable factor. Furthermore, recent proposals have shown that these jobs can run in a very cost-efficient manner on spot VMs in the Cloud [1]. However, this also increases the chance of failures and retries.

Recently, PCcheck [2] has proposed a concurrent checkpointing mechanism that allows regular checkpointing to persistent memory with reduced overhead.

This thesis will start by studying how PCcheck performs when the checkpointing medium is based on the recent CXL interconnect [3]. Based on that preliminary evaluation, the thesis will identify bottlenecks and propose improved techniques to mitigate them.

Among others, the proposed techniques will explore two main directions: i) the performance idiosyncrasies of CXL devices; ii) the existence of an interface for graceful shutdown in most spot VM offerings.

[1] https://dl.acm.org/doi/10.1145/3689031.3717459

[2] https://dl.acm.org/doi/pdf/10.1145/3669940.3707255

[3] https://dl.acm.org/doi/10.1145/3613424.3614256

 

Persistent hardware transactional memory in latest-generation Intel Xeon CPUs and on CXL persistent memories

With: Paolo Romano

The new generations of high-end server processors by Intel (since Sapphire Rapids) support shared memory synchronization with hardware transactions (Intel TSX) with new advanced features that were not available in previous generations. Concretely, there are new instructions that allow the program running inside a hardware transaction to suspend (and later resume) read tracking in the transaction.

The main goal of this thesis is to depart from our prior work, DUMBO [1], which supports persistent hardware transactions for the IBM POWER9 architecture, and redesign it to the more limited HTM interface of latest-generation Intel CPUs. The proposed solution should also be tailored to the performance trade-offs of the emerging CXL-based persistent memory devices .

[1] https://arxiv.org/abs/2410.16110

 

Bandwidth-aware page placement for CXL-based memory

The goal of this thesis is to design, implement and evaluate a practical tool for tiered page placement in Linux-based systems equipped with DRAM and CXL-based memory expansion devices. More precisely, we will depart from an existing system for dynamic tiered page placement (either Memtis, TPP or HyPlacer) and will study improvements to enable a bandwidth-aware page distribution. For bandwidth-intensive applications, the goal is to distribute popular pages across both memory tiers (DRAM and CXL) to maximize the bandwidth usage. To achieve this goal, the work will borrow ideas from previous research in our team (BWAP).

Memtis: https://dl.acm.org/doi/10.1145/3600006.3613167

TPP: https://dl.acm.org/doi/10.1145/3582016.3582063

HyPlacer: https://arxiv.org/pdf/2112.12685.pdf

BWAP: https://arxiv.org/abs/2003.03304