Fermilab CDF SVX DAQ Debugging
Charis Quay Huei Li
Mount Holyoke College
Supervisor: Mary Bishai, Fermilab
Alternate Supervisor: Jean Slaughter, Yale University
Abstract:This paper concerns the author’s internship at Fermilab in the summer of 1999. Described are several tests of the CDF (Collider Detector at Fermilab) SVX (Silicon Vertex Detector) DAQ (Data Acquisition System) and their results. Some details of, and efforts to streamline, the running of the system are also included.
August 5, 1999.
At least for now, Fermilab, located in the middle of nowhere (Batavia, IL, to be exact), is the highest energy particle accelerator in the world. The basic idea of the collider experiments (as opposed to fixed-target in which protons are aimed at a fixed target, usually copper) at Fermilab is to take protons and anti-protons (, pronounced pee-bar), accelerate them in a ring called the TeVatron till they are moving close to the speed of light, then smash them together. The particle beams are not continuous, but are really ‘bunches’ of particles. The beams are accelerated by electric fields and bent by magnetic fields. By studying the new particles produced in these collisions, we hope to understand more about the nature of nature.
This paper is about what I did in the summer of 1999 at Fermilab. My main job was to help run the CDF (Collider Detector at Fermilab) SVX (Silicon Vertex Detector) DAQ (Data Acquisition System), which was being debugged. I also wrote some software and documentation to help make the lives of future DAQ users less miserable. The long story follows.
Figure 2: CDF Cross Section
The Collider Detector at Fermilab is one of two detectors at Fermilab, the other being DZero. CDF is also known as BZero. This is because there are 6 places around the ring where collisions can occur, AZero ... FZero. Figures 1 and 2 show what CDF looks like.
IV. SVX (Silicon Vertex Detector)
The Silicon Vertex Detector is the innermost layer of the CDF. It is so named because it is made primarily of silicon, and because its main job has to do with vertices. The primary and secondary vertices of a particle are where it is produced and where it decays respectively. When dealing with particles with short lifetimes – such as the infamous top quark – it is crucial, and yet difficult to distinguish these vertices because they are so close together. The SVX is able to do this well because it has excellent spatial resolution, on the order of < 10 microns.
The first SVX was added to the CDF in 1992. It was then replaced with a radiation-hard (not easily damaged by radiation) SVX’. In Run II of the accelerator, scheduled to start in early 2000, the silicon vertex detector is called the SVXII. The main features that distinguish it from the earlier SVX’s are:
Physically, the SVX consists of 3 ‘barrels’, each with 5 concentric layers, Layer 0 to Layer 4. The building blocks of the SVX are silicon microstrip detectors. There are rows of these detectors lengthwise and crosswise on pieces of silicon called ‘ladders’. The width of these ladders varies depending on the layer. The ladders are then arranged in cylinders, 12 ladders per layer per barrel. A ‘wedge’ is a 30° slice of a barrel.
When a charged particle passes through the SVX, it induces electrical signals on strips in its path. The strips are then read out by silicon chips, which sit on hybrids at either end of the ladders. Hybrids are circuits that connect multiple chips to the silicon. A hundred and eight strips, or channels, are read out by each chip. What happens after this is covered in the next section about the DAQ (Data Acquisition System).
V. DAQ (Data Acquisition System)
The DAQ handles the data that comes back from the chip. Please refer to the simplified schematic on the next page.
The Trigger Supervisor, common to all of CDF, sends triggers to all the different parts of CDF when it sees an event that it thinks might be interesting. These triggers are received in the SVX by the Silicon Readout Controller (SRC), the ‘brain’ of the DAQ. It then sends commands to the VFO’s (VRB Fan Outs) and FFO’s (FIB FanOuts) which are fanned out to the VRB’s (VME Readout Buffer) and FIB’s (Fibre Interface Boards). There are 10 FIB’s/VRB’s for each of their FanOuts.
The FIB’s pass the commands on to the chips via junction boxes and port cards. There are 2 junction box-port card pairs for every FIB. Each Port Card is responsible for one end of a wedge, reading out 5 the 5 layers in 5 channels. If you are paying attention, it should make sense that this makes for 10 channels per FIB.
Data coming back from the chips goes through the FIB’s and is passed on to the VRB’s via 4 fibre optic links (G-links) per FIB-VRB pair.
All these things – FIB’s, VRB’s and the SRC – are boards that sit in slots in crates in the detector area. Each crate has a VME processor so that the Windows NT computers that we use to run the system can ‘talk’ to it via the Ethernet. In Run II, there will only be one SRC, but right now, there are several; one for each test stand and a few extra for a total of nine. They are named after Snow White, Prince Charming (often called ‘the Prince formerly known as Charming’ when he behaves badly) and the Seven Dwarves.
b. The SVX 3D Chip
To understand the next section about how we debug the DAQ, you first need to know a little about the workings of the chip. The current version is 3D. (Does not stand for 3-dimensional.) When an event occurs, the aforementioned ‘electrical signal’ is first amplified and stored as some quantity of charge on a capacitor known as the integrator, which sits at the front door of the chip. The charge is then transferred to one capacitor, or ‘cell’ in a ‘pipeline’ of 47. One of the cells is used as a reference for calculating event to event and channel to channel ‘noise’, or variations.
If the Trigger Supervisor thinks that an event might be of interest, a Level One Accept (L1A) is sent to tag the corresponding cell. A pointer is set so that that cell is bypassed in subsequent ‘writings’ (transfer of charge) to the pipeline until it has been digitised and read out. It is effectively taken out of the pipeline. Up to four cells can be tagged at once.
When a PRD1 (Pipeline Readout One) signal is sent, the analog data (i.e. charge) on the capacitor is passed to the ADC (Analog to Digital Converter) for digitisation and readout. After digitisation, the cell is returned to the pipeline upon receipt of the PRD2 (Pipeline Readout Two) signal. PRD2 also resets the reference cell.
Another important signal is PARST (pronounced Pre-Amp Reset), which resets the front-end integrator. This has to be done because the integrator accumulates charge at each event, transferring the difference to the pipeline. If not reset occasionally, it will become saturated.
Figure 6: Simplified Schematic of the SVX3D Chip.
B. DEBUGGING DAQ3
Now I have to confess that we don’t work with as grand a system as described in the previous section. There are several groups of people working on debugging the DAQ. Each of them works on a test stand (DAQ1 – DAQ4) consisting of close to the minimum number of boards (usually 1 or 2 of each) necessary to simulate the DAQ.
My supervisor and I are on DAQ3. For the most part, we run single chips on one port card. Sometimes, we run 2 chips on separate port cards simultaneously. Since we are not yet connected to the rest of the detector, the SRC emulates the Trigger Supervisor, sending LIA’s and L2A’s randomly. We don’t really read out ‘events’, but rather, background charge on the chip.
We employ a very simple procedure for debugging the SVX DAQ – run the system and look for bugs. When the system refuses to initialise altogether, often the culprits are trivial things like broken bonds, blown regulators or power supplies that aren’t turned on. Sometimes it turns out to be something bigger, but most of our bugs are discovered when the system is up and running.
One of the first places we usually see bugs is the histograms output by the Java program that runs the system, but we also keep our eyes open for other signs, e.g. unusual current draw from power supplies.
We then try to figure out what is causing the bug by probing signals coming from different parts of the DAQ. We look at them on the logic analyser and oscilloscope to see if they look normal and to check to see if signals are lost anywhere along the line. Then we make ‘fixes’ that we think might solve the problem. It’s like putting together pieces of a puzzle.
When we have worked out all the bugs we can find, we change some variables and run in different modes to see if the chip is affected. For example, although in theory the chip has a pipeline of 42 cells (47 – 1 reference – 4 possibly tagged), we had the Pipe Depth – the effective pipeline depth – set to 4 for a while. In one of the tests, I ran the system at varying pipe depths. Effects were observed that suggested problems in the SRC pipecap logic, which tells the SRC which cell is tagged. These were later corrected by Petar (Maksimovic) and the other SRC people.
When we think we understand what makes the chip tick at one level, we add more complexity. For example, at the moment, among other things, we are moving from running with single chips to running with multi-chip objects.
Figure 7: The DAQ Control Window
This is the main window of the DAQ–running software. We click on things in here to make the system run.
C. THE HUNTING OF A BUG
The following section describes in greater detail one particular bug – the Pre-Amp Reset (PARST) and Cell ID Problem – and how we solved it.
Figure 8: Nice Cell ID’s Figure 9: Cell ID 0. Not good.
Figure 8 shows one of the many histograms it is possible to get the Java program that runs the system to output.
The x-axis is the Cell IDs, the ‘name tags’ of the capacitors in the chip’s pipeline. The y-axis is the number of times events have been read out from that particular cell. Since the SRC is sending L1A’s (tagging cells) randomly, we would expect a more or less uniform distribution of Cell ID’s. The histogram
on the left would be considered more or less ‘normal’ or ‘good’ (which does not by any means imply that it was typical).
Now a little aside about port cards. These things come in 2 breeds, discrete and compact - also known as DPC’s and CPC’s. The major difference is that the circuit elements on the DPC are discrete, i.e. you can see the little capacitors and what not sitting on the thing, and those on the CPC are in integrated circuits, so that all you see is a box. In addition, DPC’s can only read out one ‘SVX device’ (i.e. chip or multi-chip object), while CPC’s, as advertised earlier in this report, can potentially read out 5. The DPC is just for testing and is not going to be used in Run II.
A further tangent on the use of the acronym HDI: It stands for High Density Interface, and I am told originally meant the cables that run from port card to chip, hybrid and ladder. However, it is now used variously for the cables, the channels on the port card that the cables plug into and even the SVX devices (the stuff on the port card).
Not long after CPC’s started being tested, it was observed that chips hooked up to it sometimes exhibited behaviour shown in the Figure 9, the infamous ‘Cell ID 0 problem’.
The only signals entering into the logic that determines Cell ID are PARST, LIA, PRD1, PRD2 and FECLK (Front-End Clock). We (or rather Mary and Jean) did not see how L1A and the PRD’s could cause this problem, so we started off by looking at PARST.
We modified the PARST signal by putting a resistor between it and ground. This serves to lower the level of PARST. We found that in the range of about 200W - 400W the Cell ID’s were uniform. Below 200W , they start to show ‘stuck-at-zero’ behaviour and above 400W , they become ‘pitchforked’.
Then we varied the level of DVDD – the voltage supplied to the digital end of the chip – trying various combinations of resistor and DVDD values. (The normal value of DVDD is 5V.) It was observed that the chip becomes unhappy when the level of PARST is not far enough below DVDD. A difference of about 1V seemed to iron out most problems.
Next, we tried putting capacitors between PARST and ground. This lengthens the rise and fall time of the signal, i.e. the time it takes for the signal to go from 10% to 90% of the maximum value and vice versa, respectively. Capacitors of > 300pF, corresponding to rise and fall times of > about 40ns also seemed to so the trick.
We swopped chips several times to make sure that this problem was not chip-specific. It wasn’t.
At this point, there was some debate about whether we should try to understand why the resistors and capacitors seem to solve the problem, or whether we should just make the fix and move on.
On June 22, Amy Conolly at LBL (Lawrence Berkeley National Lab) observed that the Cell ID problem goes away if PARST falls when FECLK (Front–end clock) is high. Later, the chip experts informed us that this was an unexpected design flaw in the chip.
Figure 10: PARST Falling in FECLK. And it was Good.
Mary moved PARST fall forward until it fell during FECLK high. This is done by messing with things called ‘FIB sequences’. When the SRC sends a command, say digitise, to the FIB to pass on to the chip, the FIB translates this single command, digitise, into a ‘FIB sequence’ of more detailed commands that tells the chip exactly how to digitise. However, the Cell ID problem was still observed.
Tom Zimmerman (chip expert) suggested that PARST must still be falling during FECLK high. Although we did not quite see how this could be, Ken (Treptow, FIB expert) got the oscilloscope to trigger on PARST fall when FECLK is low, and sure enough, we found that it was happening fairly frequently. We also observed that PARST width wasn’t constant. Sometimes it was high for a long time, and sometimes only for a short time.
At Jean’s suggestion, I looked at what the other signals were doing when PARST fell during FECLK low. I noticed that PRD1’s always appeared soon after PARST fall. Ken observed that PRD1 is part of the digitise cycle (i.e. the group of commands that tells the chip to digitise) and suggested that we look in that part of the FIB sequence. We did, and found a ‘PARST stop’ command at the beginning that everyone had forgotten about. All that had to be done was to move it to within a FECLK.
Afterwards, it was discovered that a few other seemingly unrelated problems were caused by the same thing. So we killed multiple birds with one stone, and that’s always good.
This was one of our bigger bugs. The cause now seems trivial, but it took more than a few people a good amount of time to solve. However, while hunting down this bug, quite a few improvements were made to the system. For example, the FIB sequence resolution (amount by which you can move signals around) was halved. Software was developed by Mary to look at ‘what the SRC thinks the Cell ID is vs. what the chip thinks it is’ more easily. This software later came in very useful in other tests. We also discovered minor bugs, like the fact that PRD2’s during digitise cause Cell ID’s to become ‘stuck-on-63’.
Figure 11: The PRD1 Clue
There was a need for a document describing the basics of running the SVX DAQ system. Such a document did not then exist and it will become crucial that it should exist as more people start using the system in the future. Therefore, in my copious free time, I wrote the DAQ Cookbook and converted it into the Queen’s HTML (i.e. not editor-generated) and Vince Pavlicek put it on the web at http://www-ese.fnal.gov/eseproj/svx/DAQ_user/DAQ.htm. The original Word 97 version is also available on the web, for ‘those who can’ athttp://www-ese.fnal.gov/eseproj/svx/DAQ_user/DAQ.doc.
I also wrote several perl scripts as the need arose. A brief description of the scripts follows:
cfgcrate: My masterpiece this summer. It allows the user to add, remove and modify modules (boards) and ‘SVX devices’ in crate configuration files. It has also been properly trained to squeal if you try to do illegal things such as putting more than one module in a slot or inventing non-existent slot/HDI numbers. In the Dark Ages before The Script, users had to manually edit the crate configuration files. This is not too bad with a small system, but as you add more modules to your crates it can get tedious. And as the user becomes more and more bored, the likelihood of error also increases. Computers, being what they are, will stop working at the drop of a bracket, so this script made it less likely any such thing should happen.
As an example of before and after, suppose you wanted to add a FIB and SVX devices (chip and multi-chip objects) attached thereto to slot 9 in the crate daesw1.
In the first place, the user would have to have to know how to use a UNIX text editor like pico, vi or emacs, and know the format of the cfg file s/he wants to modify.
Harv5.fnal.gov 22% emacs daesw1.crate &
¯ ¯ ¯ ¯ ¯ ¯ ... ... ...move cursor
Slot: space 21 tab Base Addr: ...
‘What’s 21 divided by 2 multiplied by 10000000 in hex?’
‘I dunno. Look it up’.
Pull up number converter. (Grumble.)
... space 0xa8000000 tab Cfg file: space std.fib
etc. etc. ...
And at this point, the user has only added the FIB.
Figure 12: Before the Script
Harv5.fnal.gov 22% cfgcrate
Please enter crate name.
Please enter slot number.
What would you like to do? <a/d/m>a
What would you like to add? <fib/ffo/src/svx/vrb>fib
FIB added to crate daesw1 slot 21
There are no SVX devices set up on this FIB. You can add some now.
Please enter HDI's separated by spaces.
0 3 9982
How many chips on HDI 0?
1 chips added to HDI 0 on FIB in crate daesw1 slot 21
How many chips on HDI 3?
Invalid number of chips.
Please re-enter number of chips for HDI 3.
1 chips added to HDI 3 on FIB in crate daesw1 slot 21
9982 is not a valid HDI number.
I am just going to skip it.
Would you like to do anything else to this crate? <y/n>
Have a nice life!
Figure 13: After the Script
As you can see, the script requires a lot less thinking and understanding of the innards of configuration files on the part of the user. Also, after manually modifying a configuration file, there is no guarantee that you have not forgotten a space somewhere or spelled a long word - like ‘address’ - wrongly. (And even if you had spelled it right, it turns out that in the land of configuration files, it’s spelled ‘Addr’.)
This script grew increasingly complex as the summer went on, as bugs and loose ends were discovered, as more features and error messages were added and as I learned more perl. It ended up as a 400-plus-line monster.
cfgsys: This is a very simple script that asks the user for the names of all the crates in his system and puts the crates in the file system.cfg which is read by the Java program that runs the system.
multichip: Based on the single chip configuration file 1chip.svx, this script allows changes to propagate rapidly through all the multi-chip configuration files and will also generate those files if they do not already exist.
This is very useful since there is a different configuration file for every multi-chip object. Without the script, to have all the necessary configuration files to run/test the SVX DAQ, you would have to write a block of code for every single chip on each of the 2, 4, 5, 6, 7, 10 and 14 chip hybrids/ladders. Even if you cut, pasted and then edited where necessary, it would take an awfully long time.
As for changing variables, e.g. if you want to change the pipe depth, instead of changing it for every single chip in every file, change the number next to Pipe Depth in the script, save it, and then run.
The downside of this script is that users have to mess with the actual script, and be very careful not to delete important things like brackets and quotation marks. I thought about getting the script to read from the single chip file and outputting to the other files, but concluded that it would mean too much effort for a very little extra convenience. In retrospect, it is probably better the way it is, because in setting up a new test stand, only one file (multichip) needs to be moved, not two (1chip.svx too).
file-list: The motivation for this script came from Jean’s frustration one morning after having to open tons of data files to see if they were the ones she wanted. This script pokes through data files for you and outputs the vital statistics – number of chips, last date modified, name of the detector and comments – of files over 100KB. The reason for the 100KB restriction: Sometimes runs are aborted prematurely and aren’t worth analysing. These runs are typically well under 100KB.
There will be no conclusion to the debugging of SVX DAQ, not until Run II of the accelerator next year, and perhaps not even then. Therefore, I offer you some philosophical thoughts in lieu of a conclusion.
This summer I was exposed to many new things – Big Science, big accelerators, the real world, physics collaboration and Windows NT. I’ve also learnt a lot – Perl, how to use logic analysers and oscilloscopes, some new and very cool UNIX commands, and more than I ever wanted to know about CDF’s SVX. It’s been a great experience.
Many thanks are due to the SVX DAQ group in general and my supervisor and alternate supervisor, Mary Bishai and Jean Slaughter in particular, for their longsuffering and guidance throughout the summer. I must not forget Mary’s lifts home on rainy days, the loan of Jean’s bike and many other small things that helped make my time at Fermilab happy, productive and educational.
The moral support of the physics department at Mount Holyoke and my family is much appreciated.
I am also grateful to Elliot McCrory, Dianne Engram, Jim Davenport and the rest of the SIST committee for making this opportunity possible for the other interns and me, and for their administrative and organisational efforts.
Last but not least, thanks to God, from whom all blessings flow.
REFERENCES AND BIBLIOGRAPHY
Bishai, M. et. al. An SVX3D Chip User’s Companion. Fermilab. 1999.
CDF II Collaboration. The CDF II Detector Technical Design Report. Fermilab, 1996. (96/390-E)
Fermilab. (July 6, 1999). Fermi National Accelerator Laboratory. [Online] Available WWW: http://www.fnal.gov Directory/File: Any and all. [Summer 1999]
Free-Ed, Ltd. (1999). Introduction to Perl and CGI Programming. [Online] Available WWW: http://www.free-ed.net Directory: fr03/lfc/course%20030207_01 Files: All in directory. [Summer 1999]
Likte, A. M. and Schwarz, A.S. The Silicon Microstrip Detector. Scientific American, May 1995, pp. 76-81.