Apr 27 — 2021

Quantum Physics II — Data collection, modelling, and analysis at CMS CERN

HOW DATA ANALYSIS AT CERN CAN HELP DETECT DARK MATTER A comprehensive guide of the CERN workflow in new Physics discoveries.


In this second part of the previous introductory article, we’ll tackle the more in-depth description of data collection, object modelling, and data analysis at CMS. The general workflow behind these kinds of experiments is complex, but I’ll try to give a brief description of each part so you can get a general idea of the whole process.

Analysis of a collision and important concepts

We‘ve been talking about particles and detectors, but an important question remains unanswered: what exactly is colliding? The answer is protons, or more exactly, bunches of protons. Accelerated through the LHC by multiple solenoid setups, these bunches come across one another in the detectors, and a collision happens. This proton beam has no perpendicular component, meaning it only moves in a circular manner through the LHC, due to the solenoids’ magnetic field; this is done on purpose since studying the collision in the perpendicular plane means the conservation of energy applies and all momentum must be null, meaning if there’s an imbalance an invisible particle has been produced. A cross-section of the detector can be seen below, with its components (here). Some examples of these particles are neutrinos, but also possible dark matter or other theoretical particles. So now, what’s the next step?

At the interaction vertex, these particles collide and decay products flow through the different parts of the detector, constructed so that measures are taken with high precision and allow for the (mostly) unequivocal identification of particles. However, not everything is as straightforward as it seems, and several events need to be taken into account; I’ll briefly explain them below.

One of the most important magnitudes in colliders is luminosity, which is interpreted as the number of collisions per unit of time and surface. As of 2018, when Run 2 ended, the luminosity value meant that every 25 ns a collision occurred and that almost 150 femtobarns of cumulative data (each femtobarn equals 100 billion collisions!) were analysed.

This raises a significant problem: it’s obvious that more collisions are needed to assess the veracity of models, but if there are too many, the detector won’t be able to keep up with them; this is further complicated since the interactions are among proton bunches. The concept is known as pile-up, and it refers to the average number of collisions that occur each time the detector tries to check the decay products of an interaction. With current data, this value is around 20 collisions, but CERN is allocating its resources to an upgrade called High-Luminosity LHC, which will improve on the integrated luminosity and, in turn, increase the pile-up to almost 200; this is unfeasible with current hardware and software, meaning this project will need serious backing and development; however, the benefits far outweigh the difficulties.

Okay, and now that we know a lot of collisions occur at almost the same time, how is CMS able to discern the decay products of the interaction it wants? Not only that, but the amount of data produced is orders of magnitude over what current hardware and software can process and store. This is the core process of CERN data gathering, and it’s taken care of by several software and hardware solutions commonly referred to as trigger. It’s a really interesting and core part of the data gathering process, but it’s too extensive and technical to discuss here, so I’ll explain the methodology behind it briefly and leave some documentation here, here, and here in case you’re interested. I’ll also leave a diagram below that summarises this information.

The first phase filters down a lot of data using detector hardware to register adequate data in a really short time (almost a millionth of a second!), taking into account information from the muon chambers and calorimeters specifically, and is known as the level-1 trigger or L1. This phase can cut down the total recorded data from 40 MHz up to just 100 kHz, offering the first cut towards useful data. Next, a second set of commercial processors and parallel machines takes that data and further refines it based on precision parameters, operating offline; this is called the high-level trigger or HLT. Then, it creates a backup of the data and sends it to the associated research institutions for analysis. Even at this last stage, recorded data is still around the 1 GB/s mark, showing just how much the trigger and further resources are needed.

CMS Level-1 trigger overview.
Schematic view of the CMS Data Acquisition System

Collision object identification

Now that collision data has been recorded, the next step is to identify the particles based on their traces through the detector. This will allow scientists to reconstruct the particles that have appeared as a result of the collision, using the data collected at the different parts of the detector. These reconstructions of particles at CMS employ an algorithm called Particle Flow; the algorithm itself is really complex and takes into account many measurements and variables to discriminate between particles and correctly tag them, so please check the documentation provided if you’re interested.

With this algorithm, photons and leptons (especially muons) are easily tagged with the algorithm, but more complex objects such as hadrons are harder to tag. The difficulty associated with this jet tagging is related to the nature of the Strong force and a concept called sea quarks that we’ll discuss briefly next.

In a collision between two protons, or more accurately between proton quarks, the immediate thought is that only the constituent quarks can interact, so only up and down quarks would interact. However, a measure of the mass of the proton and these quarks shows that most of the mass doesn’t come from its constituents but from internal bounding energy. This means that this excess energy, if the collider energy is enough, can produce other quarks such as quarks bottom and top, both several times the mass of quarks u and d, and these quarks may be the ones that interact at the collision point; an example of a collision may be seen in the image below.

Schematic interaction of two colliding protons and their partons.

An important concept of jet tagging is that, usually, the decay-inducing quark is unknown since the hadronisation process is so chaotic, but a certain type of jet can be identified. Jets coming from the decay of a b quark have a characteristic second vertex, which has enough travel distance from the first as to be measured (the reason is that b quarks have a longer lifetime than other quarks); for this reason, it’s perceived as a separate important event and receives the name b-tagging. This event has a specific algorithm created to detect it called CVS.


Data simulation and model checking

Once meaningful data has been retrieved and objects are recreated, it’s time to check this data against previously known results for the SM. The way CERN takes care of this comparison is by employing data simulations with MonteCarlo samples. These simulations include all data related to processes’ cross-sections (defined as the probability of decay related to all possible decays), decays, and detector components (to the point of knowing the location of each pixel!) so that the uncertainty of these controllable events is minimised and meaningful conclusions can be made; if we want to measure a cross-section for dark matter, which may be very low, these uncertainties could be either the defining point of discovery or just statistical variations.

The algorithms simulate particles moving through the detector, interacting with each other, showing decay channels to high perturbation theory orders, and in general being very precise with the locations and efficiency of each and every part of the detector. The main algorithms used in this simulation are Powheg and aMC@NLO, both of them built on Pythia. Afterwards, the software Geant4 simulates the particle interactions with the CMS detector. These algorithms provide SM-accurate processes for all the different backgrounds needed in the analysis, which we will explain next.

Now that we have collected real collision data and have data simulations, the next step is to define the process we want to study, like a certain decay of particles producing dark matter. To check if dark matter production is possible in this model, the investigator must include in its data all the possible backgrounds; in this context, a background is a decay that leaves the same traces in the detector as the main process we want to study.

It’s mandatory at this state to include blinding to the data, meaning that we shouldn’t include real data until the end of the study; this prevents the investigator from being biased.

Finally, the goal of these discovery projects can be summarised in one sentence: after including the simulation data, the investigator selects measures like the number of jets, tagged b quarks, missing transverse energy, etc. or defines new ones that they think could potentially be used to discriminate the signal (the studied process) against the backgrounds; this means, for example, selecting variable intervals where the dark matter processes are abundant while background processes are not.

Afterwards, the investigator includes the real data and checks if the results are in agreement with the SM. The way this is done is by a hypothesis test of:

  • H0: SM physics
  • H1: BSM physics

This includes doing some advanced statistical analysis, like creating a modified likelihood ratio to obtain confidence intervals (CI) of the cross-sections of these processes. This is very technical so I’ll leave some documentation here that hints at the entire process, but the basic idea behind it is that we can compare the obtained results with what is expected and:

If they’re similar, then the investigator has fine-tuned the limits of the cross-section of the model for future references, and further studies of that model are made easier.
If they aren’t similar, then they might have discovered something new! In particle physics, this case is presented when differences between theoretical and experimental cross-sections exceed 2𝝈 uncertainty, but it’s not classified as a discovery until the 5𝝈 threshold; 𝝈 in this context refers to the confidence interval associated, and this article gives a really good explanation of this concept. This shows that, in particle physics, we are really certain of the results obtained.

As an example, the most recent case of this 5𝝈 threshold was the discovery of the Higgs boson back in 2012.


The goal of this article is to show the reader the workflow of a CMS investigator researching a certain process, either a search for new particles like dark matter or already studied processes. From the detector components to the data collection, simulation and analysis, I hope you have acquired a general understanding of these concepts, albeit very superficial. In that regard, the literature regarding this subject is written by CERN physicists, so assumptions about all these steps are regularly made, and the average reader will be lost in the concepts and common slang employed.

I hope this article has helped you get a better understanding of this workflow, and that it’ll maybe spark some interest in particle physics, helping you understand further research done and news stories about it.

Apr 27 — 2021

Quantum Physics I — the CERN data workflow in new Physics discoveries.

HOW DATA ANALYSIS AT CERN CAN HELP DETECT DARK MATTER A comprehensive guide of the CERN workflow in new Physics discoveries.


A comprehensive guide of the CERN workflow in new Physics discoveries.

Are you interested in learning more about particle physics? You might have heard terms like neutrinos, quarks, or dark matter mentioned before and want to know more about them. However, the literature and articles involved mainly use terms and concepts that the reader is supposed to already know, which makes them inaccessible to anyone not in this field.

In this article, I’ll provide an easy-to-understand explanation about everything you’d need to know to understand the main points of these articles: the main results, how they’ve been obtained, and the methodology behind their collection. It will be divided into two parts:

  • The first one will focus on the basic concepts regarding this subject: a basic Standard Model (SM) overview and a description of dark matter and the Compact Muon Solenoid (CMS) detector.
  • The second one will focus on the details of data collection and analysis at CERN; this one is where the Data Science component will be.

My experience with this subject comes from my undergraduate project in Physics dedicated to a dark matter model verification with LHC Run 2 data, where my love for experimental particle physics was consolidated. I hope this article will help you understand the basics of these studies at CERN and get you started in further publications on this topic.

Standard Model Basics

We’ll start with a short description of the Standard Model (SM), the theory that describes the structure of matter and its interactions. It’s worth noting that the terminology model comes from the 1970s, when this theory didn’t have enough experimental evidence to support it as such, while nowadays it does.

The SM postulates the existence of two kinds of particles: fermions, which compose all visible matter, and baryons, that mediate the fundamental interactions (ElectroMagnetism, Strong and Weak forces, and Gravity; integrating this last one is still one of the biggest mysteries of modern physics). According to this theory, every interaction between two particles is mediated via the exchange of a boson. Below you can see a simple diagram of these particles and their classifications. The main difference between these two types of particles is their spin: fermions have half-integer spins, while bosons have integer spins, and this is the main reason the physics around these particles are so different.

Fermions are further divided into two families of 6 particles each: leptons (electron, muon, tau, and their respective neutrinos; these are particles with little mass and no electric charge that appear in the decay of their associated lepton) and quarks; the main difference between these is that quarks are affected by the Strong force, while leptons aren’t, and the reason for this comes from the charges of these particles. Another kind of particle not shown in the previous graphic is the antiparticle, which is the same particle, but with opposite electric charge; you may know this by its much cooler name of antimatter. Furthermore, quarks cannot be found as independent particles and must form bigger particles called hadrons; the most famous ones being the proton and neutron, composed of quark configurations uud and udd respectively; this phenomenon of hadronization of quarks is called confinement and is a really interesting topic for those interested. Since the fundamental interactions are mediated by bosons, these must couple with the corresponding fermions, but that is only possible if they share the same type of charge. To further explain this, let’s examine each of these forces:
  • ElectroMagnetism: this force is mediated by the photon (γ) between particles that share electric charge; this includes all quarks, all leptons except neutrinos, and both bosons W⁺ and W⁻.
  • Weak force: this is the main force behind the decay of some particles into others, like radioactivity, and is mediated by the bosons W⁺, W⁻, and the boson Z⁰ (the superscript indicates their electric charge). The charge needed for this interaction is the weak charge, and all fermions have it.
  • Strong force: this is the force that binds the atomic nucleus together, and is mediated by the gluon (g), which has no electric or weak charges. However, a big difference with the previous interactions is that the gluon possesses colour (this is the name given to the strong charge). Its name comes from the fact that this interaction is several times stronger than electromagnetism, thus showing how the nucleus can exist even when it’s made out of protons that should be repelling one another. Only quarks have colour, and as such are the only particles affected by the strong force.
The last piece of the puzzle in the SM is the Higgs boson, which couples to every particle (including itself!) except neutrinos, and is responsible for the mass of the particle. This may seem like a lot of information at once, but don’t worry, the goal of this overview of the SM is to give some context and familiarise you with the forces and particles that rule our universe, you can always reread it later.

What is Dark Matter?

We’ve talked about dark matter in this article before, but we haven’t really given any sort of description of it; we’ll tackle this briefly in this section, discussing the evidence that supports its existence as well.

Several sources, even going back as far as the 1930s, show that astronomical calculations involving galaxy masses and rotational speeds don’t match with the expected results from the observable masses they have. One example of this is the Cosmic Microwave Background or CMB, where we learned that baryonic matter (meaning stars, planets, humans, etc.) only makes up ~5% of the total universe mass; here is some documentation on CMB, gravitational lensing and the Hubble Law that expands on this matter.

Is there a possibility that this is actually some sort of known particle? Some possible candidates could be :
  • Antimatter: this is impossible since the matter-antimatter annihilation process shows very characteristic 𝜸-rays, and these are not seen.
  • Black holes: again, it can’t be since black holes curve light around them, and dark matter doesn’t affect photons.
So, now we know most of the universe is made out of some sort of matter/energy that doesn’t interact electromagnetically, which makes it pretty much invisible to most detectors on Earth, but does interact gravitationally; if it didn’t, then we wouldn’t even know it existed. Since this is the case, the naming convention for this new type of matter is dark matter, the dark alluding to the fact that it doesn’t interact normally with baryonic matter. As you can guess, knowing that around 90% of the universe is made out of something we don’t know makes this one of the main research topics of modern-day physics, including searches in particle accelerators like LHC.

The CMS detector of the LHC

The LHC is the biggest, most powerful particle collider in the world, and provides the best opportunity to discover new particles, like dark matter. In this article, we’ll focus on CMS (Compact Muon Solenoid), but the data collection and general workflow apply to the other three main detectors: ATLAS, ALICE, and LHCb. Its name comes from the muon chambers that, combined with the solenoid, offer the best muon detection resolution available today. There are several components in CMS that can detect all traces of different particles leaving the collision point; if you want a more detailed explanation take a look at this article. There are lots of components, as you can see, but the important thing you have to remember is that their purpose is to analyse the different tracks that different particles leave: electric charge, movement traces, energy, etc. With all of these concepts introduced, we’re ready to dive deeper into the specifics of data collection at CERN in our second part!
Dec 11 — 2020

Investing in Artificial Intelligence

Investing in Artificial Intelligence


While overall adoption of artificial intelligence (AI) remains low among businesses, every senior executive that I discuss the topic with claims that AI isn’t hype, although they confirm that they feel uncertain about where these disciplines may provide the largest rewards. One premise is obvious to many: To-be-launched AI initiatives must generate business value, revenue, and/or cost reductions, at least at the function level, so that CFOs and other executive committees members are on board. A recent survey by McKinsey claims that a small cluster of respondents from various vertical industries already attribute 20 % (or more) in their organisations’ P&L sheet to AI. The topic is undoubtedly worth a thorough evaluation.

Aiming to speak knowledgeably to provide informed recommendations to the reader, I write from experience, also striving to explore scientific breakthroughs and validated use cases that can be shared. I hope this article may serve as a starting point for any business leader that needs to take a leap forward to sustain competitiveness, or that aims to enhance the quality and impact of their work.

In this article I attempt to advise the reader on different ways to make the first step into the discipline, commenting on the organisational areas where AI can have the biggest impact short-term. Before you read on you should not ignore a crucial point: AI’s end goal is to serve and empower humans to do more, to be better, smarter and happier. Corporations operating more efficiently and generating added value in the market can only be a consequence if this is well understood. I highly recommend a proper understanding of terms (Data Science and AI) as a starting point for digesting this article.


For some time now, CDOs and CIOs have started to use different Machine Learning, and other AI capabilities, without paying attention to the highest possible return on investments. Our work consistently shows that the organisational units that most commonly reap larger rewards when AI mechanisms are implemented tend to be the areas where AI can have the most significant impact. Sounds logical but, what does it mean? Well, AI can have the biggest short-term impact where more money is being spent. So, the first piece of advice here is to follow the money. There are studies from McKinsey claiming that both supply chain management (manufacturing) and sales generation (advertising and marketing on b2c strategies) are the two functional units where AI has traditionally proved to have the biggest impact. Both of these business areas need heavy capital expenditure so it seems reasonable that both areas quickly reap big profits.

AI can have the biggest short-term impact where more money is being spent.
So, the first piece of advice here is to follow the money.

The second way leaders can make up their mind about where to apply AI is to simply look at the functions where traditional analytics are already operating usefully and where they may have room to evolve. Simply because AI, through Neural Networks, for instance, may strengthen use cases, by providing additional insight and accuracy, better than established analytical techniques. If you lack the computational power and/or the knowledge on how to apply state-of-the-art machine learning or deep learning algorithms you only need to reach out to data scientists. It may not be valid in some cases, i.e if the additional accuracy is not worth the investment or when a bullet-proof scientific equation already works well and does not require any Machine Learning algorithm. Either way, my second piece of advice: each traditional analytical method currently being used is worth a thorough evaluation since an additional implementation could make a qualitative and quantitative difference.

A third option, looking into potential Robotic Process Automation. You can think of RPA as a software robot that mimics human actions that can handle high-volume, repetitive tasks at scale. It is an evolution of the traditional Robotic Desktop Automation (RDA) that has helped tremendously in the past by simplifying, automating and integrating technologies and processes on employees’ desktops.

AI investing Investing in Artificial Intelligence

The main difference between RDA and RPA is its scope. RDA is implemented in each user device, only interacting with applications and software of that specific user. RPA encompasses multiple users, departments and applications. If you look at the previous graph, you will also be able to see that RPA is associated with doing, whereas AI is concerned with the simulation of human intelligence by machines. RPA is suitable for automating grunt, repetitive, rule-based tasks where humans only spend time actioning them, but can’t improve much over time. For this third option, based on RPA being highly process-driven, I recommend conducting an initial process discovery workshop that may serve as a prerequisite to mapping out the existing “as is” workflows in order to identify gaps and inefficiencies.

Many of our clients believe RPA is a smart safe bet as a first step on the AI stairway and they have cited reasons such as wanting to achieve quick-wins and capturing those low hanging fruits. Our time-to-market for RPA projects at Bedrock is usually a matter of weeks with reasonable costs and challenges, but guaranteeing a measurable return on investment i.e. We have led projects of the kind where human labour went down almost 89%. This is obviously a significant cost reduction, but it also reduces human interaction thereby reducing room for manual errors.

The next step on the stairway would take your business to full intelligent automation, on which I will build upon my fourth piece of advice. AI can help you to automate decision making because modern computational power can outperform humans’ ability to process data and it is your duty to assess if this can help you and other business leaders that you work with. AI applied to power better decision-making has already been called Augmented Intelligence. It is an alternative conceptualisation of AI that focuses on being an assistive role as a cognitive technology, designed to enhance human intelligence rather than replacing it. This means that it could assist not only C-level execs or boards of directors but also any managerial layers in effective decision making. You must plan for building a hybrid collaborative approach where human intelligence and AI work together.

Gartner actually estimated that by 2030, decision “augmentation” will surpass all other types of AI initiatives to account for 44% of the global AI-derived business value. They go on forecasting that in 2021, in the corporate world it will generate roughly $3 trillion dollars of business value. Then, it is your choice if your earnings before interest and taxes (EBIT) falls within this forecasted amount.

You must plan for building a hybrid collaborative approach where human intelligence and AI work together.

Investing in Artificial Intelligence

My last piece of advice, the fifth, and probably the most important, is to care about your people. Embedding AI across an organisation means a big cultural shock. There has been some paranoia about AI taking over jobs, but it needs to be understood how it is a matter of adapting our society towards advancements. Humans only shape our technologies at the moment of conception, but from that point onward they shape us. This can be seen when we came up with the smartphone because that little device influenced how we communicated with relatives and did business. The same applies to commercial flights or automobiles decades ago. Humans rebuilt our cities and lives based on these breakthroughs and the same will apply to AI.

Many still perceive AI as a job killer when it must be seen as a powerful hybrid (human and robotic) workforce. Getting buy-in for this new “workforce” might be difficult because humans fear the unknown. The correct response from leaders is being open and honest with employees, providing everyone with an understanding of what AI is, how it will be used and what will be the lasting impact on current workers and their lives.

Humans rebuilt our cities and lives based on these breakthroughs and the same will apply to AI.

Moreover, mastering AI requires specialised talent and tech tools, as well as extensive training to ensure proper adoption. The ROI of this last piece of advice is not as tangible and measurable as the previous ones, but it will surely make a difference long-term.


Summing up, I have provided you with objective and useful advice on how to make the first step in the AI journey, sharing five recommendations on how your organisation should make safe moves in the discipline which were:

  1. Paying attention to areas where big money is being spent, effectively leveraging specific domain expertise.
  2. Assessing which current analytical methods could be improved. New insights or higher accuracies may quickly surface with advanced ML methods.
  3. Mapping out repetitive processes that could be subject for efficient RPA. Humans must do intelligent work. Leave repetitive tasks for machines.
  4. Building a transversal Augmented Intelligence capability. Computers can handle more data than your brain. They also handle it objectively and without getting tired. Make them work for and with you.
  5. Remember it is all about people: Culture, transparency and robust, efficient processes are the most solid foundation on which to build an AI-powered business upon.

For the past few years, many enterprises have wasted millions of euros on digital transformation initiatives that were not aligned with the real requirements of the business, let alone with individuals’ needs. AI success in 2021 enterprises and beyond will only be possible if you are capable of aligning the right technology, educated employees and intelligent processes to the business’ long-term vision.

If you are leading a company where you are planning to test the waters and make the first step with AI, follow the previous five recommendations and you will not go wrong.

Do not hesitate and move fast!

Start your own ecosystem of AI partnerships and providers. Outsource if you need to do so. Why should you be in such a hurry? Well COVID-19 has exponentially accelerated digitalisation. Companies that are currently benefiting from AI are planning to invest even more in response to the post-pandemic era and inevitably this will soon create a wider divide between AI leaders and the majority of companies still struggling to capitalise on the technology. So the long-term danger for you is not losing jobs to robots, but to remain competitive in your market niche.