Atacama Large Millimeter Array
Demultiplexer design based on a custom GaAs chip

G. Collodi$^{1,2}$, G. Comoretto$^2$


1)Mecsa Dipartimento di Elettronica e Telecomunicazioni – v. Lombroso 7 - Firenze
2) INAF - Osservatorio Astrofisico di Arcetri – largo E. Fermi 5 - Firenze

Firenze, June 2002
Summary

In the ALMA sampler the input data, sampled at 4 Gs/s, is transmitted to the fiber optic formatter in a parallel demultiplexed data stream composed of 16 x 250 MHz samples. Several different approaches have been proposed for the demultiplexer, as no commercial units are available at this frequency. We propose a scheme based on a custom digital GaAs chip. The chip architecture, the global demultiplexer architecture and results from simulations are presented.

Introduction

In the ALMA antenna backend electronics, signals from the receivers are converted to eight IF channels, each one covering 2 GHz of bandwidth in the 2-4 GHz range. Each IF signal is subsequently sampled at 4 GS/s, 3 bit, and sent to the correlator on a fiber link. The 8 IF channels are organized as 4 IF pairs, usually representing the two polarizations of the same RF band.

Each pair is processed by a sampler. The sampler subsystem accepts two signals in the band 2-4 GHz, and delivers a signal on three optic fibers, at 10 GS/s. The sampled signal is time-demultiplexer at 250 MS/s before being formatted for fiber transmission.

At the moment, there is not a clear proposal for the demultiplexer. Several schemes have been proposed, involving commercial deserializers for SONET standard[12], delay lines, or commercial discrete chips[9]. In either case, the demultiplexer must be synchronous with the sampler clock[13]. In particular, commercial chips for deserializing optical data present the problem that the demultiplexer clock is generated internally or derived from the data, and cannot be easily synchronized to an external reference. In this report we propose a scheme based on a custom GaAs demultiplexer chip, designed to overcome this problem.

This approach presents several advantages in term of power consumption and system integration, and eases the problems of synchronizing the demultiplexer clock.

Global design

The sampler subsystem is composed of two sampler units and a fiber formatter. Each sampler unit accepts a signal at approximately 0 dBm, and must provide to the formatter 48 bits of data at 250 MHz, synchronous with the 125 MHz global clock. The sampler accepts a sampling clock, nominally sinusoidal at 0 dBm, with a frequency of 4 GHz and a variable phase, and a demultiplexer clock, obtained dividing the 4 GHz clock. The phase of the demux clock may vary in a limited range (nominally +/-130 ps), following the 4 GHz clock, and its frequency is 250 MHz (1/16 demultiplexing). Both these clocks have to be generated in a single unit for each antenna, and are common to all samplers.

Each sampler unit is composed by various subunits, that process the input signal in sequence. The global schematics for a single unit is shown in Fig. 1. The signal is sampled by an optional track/hold unit[5], digitizd to three bits[4, 6, 7], and then parallelized to 16 signals at 250 MHz. Auxiliary circuits must distribute clock and power supplies to all the units, control the threshold levels in the digitizer, monitor the sampled signal statistics and provide some crude test signal. A single high speed FPGA (e.g. XCV2 family) can integrate all the control functions, and resynchronize the demux outputs to the system clock.

In our proposal the optional T/H, the sampler chip and the demux are hosted in a multichip module, together with supply filters, clock power splitter, and termination resistor chips. The multichip substrate must have good electric properties up to 12 GHz, and provide a good thermal path for the high speed chips. Each chip can be provided either as a packaged component or as a bonded die. The first is preferable for ease of integration (especially for mass production), the latter for thermal and electric considerations.

The module can be mounted on a conventional PCB, that hosts the supply regulators, threshold control circuits and the FPGA for control and monitor. Since the subunits composing the sampler are closely related, their interrelations are important and must be carefully considered in the system design. A proposal for the complete sampler system design will be the subject of a forthcoming report.
Deserializer chip

Each bit from the sampler is independently deserialized by a single chip unit. The relative synchronization of the three bits is guaranteed by the common clock signals, and by the reduced dimensions of the multichip module. The block schematic of the chip is shown in fig. 2. It is composed by two sets of 16 identical D master-slave flip-flops, organized as a 4 GHz shift register ad a 250 MHz parallel latch. The circuit operates internally with custom logic levels, the input and output signals being converted by appropriate buffers. This gives full freedom on the logic levels to use. In particular, the sampler output levels are still to be decided, either as LVDS or LVPECL.
The 4 GHz clock is given as a 0 dBm sinusoidal tone, to avoid problems due to harmonic distortion. A passive splitter provides this signal for all subunits in the multichip module. Relative phasing must be determined to ensure proper setup and hold times between units. The 250 MHz clock can be provided in parallel as a LVDS signal, terminated externally after the last demultiplexer chip.

The chip requires 1.2V for the logic core (shift register, latch), and 2.5V for the input/output buffers.

**Technology**

The process used to implement the demux is the ED02AH process from OMMIC. The main characteristic of the process are summarized in the following table

<table>
<thead>
<tr>
<th>ED02AH Main Characteristics</th>
</tr>
</thead>
<tbody>
<tr>
<td>Interconnect. Met. layer(s):</td>
</tr>
<tr>
<td>Poly layer(s):</td>
</tr>
<tr>
<td>Spec. process char.:</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>Fixed die sizes:</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td>Available I/O:</td>
</tr>
<tr>
<td>Temp. range:</td>
</tr>
</tbody>
</table>

The process is accessible through Circuit Multi Project (CMP) which is available also for small scale production (thousand of units)

Future development using CMOS technology, not available now, has been taken into account and will follow in a year.

**Circuit Topology:**

The demultiplexer has been implemented with a Master Slave D Flip Flop with a 4 GHz clock. Each flip flop make use of 8 NOR gates.

Different topology has been evaluated in order to implement the NOR function [10]. In particular Source Coupled FET Logic (SCFL), Buffered FET Logic (BFL) and Super Buffered FET Logic (SBFL) has been taken into account. The performance of all these different topologies has been evaluated on the basis of simulations and a comparison of the results has been done. SCFL, BFL and SBFL NOR gates have all shown good performance in terms of speed and immunity from noise, but the first two have shown an elevate power consumption. Moreover the NOR gate using the BFL approach has also a large surface occupation due to the level shifter (it needs at least one diode to make the output of the gate compatible with the input of the following one). SBFL NOR topology has been the only one capable to match both the low power consumption and small surface occupation specifications, as well as high speed performance. Moreover the output of the gate does not need to use a buffer or level shifter but is fully compatible with the input of the gate built in the same technology ($V_{OH} = V_{IH}$ and $V_{OL} = V_{IL}$). This translates in a simple layout design,
without the need of a diode such as in SCFL or BFL, and in a significant reduction in power consumption. The only drawback of this topology is the use of both enhancement and depletion devices in the same circuit.

Schematics of SBFL NOR gate is shown in Fig. 3. Vdd is 1.2 V. and the gate uses positive logic with $V_{OH} = 0.7V$ and $V_{OL}=0V$.

Simulation Results:

As highlighted in the previous paragraph many different simulations have been carried out in order to evaluate the behavior of the whole system, as well as that of the elemental circuits. In particular, until now, both the performance of a single flip-flop and a of 1:8 complete demultiplexer have been tested. Simulations have been carried out using the SPICE Time Domain Simulator and the related OMMIC ED02AH Library.

Simulations have shown that the D Master Slave Flip Flop implemented using SBFL NOR gate has a delay time of 110-120 pS, a setup time of 80 pS and a hold time of –30 pS. These performances allow for a maximum clock frequency of 5GHz. With a 4GHz Clock the Flip Flop has a power consumption of about 7 mW.

Fig.3 – SBFL NOR Gate.

Fig.4 – 1:8 DEMUX Simulation Results
A complete set of simulations for 1:8 demux with a pseudorandom input sequence has been carried out and the results are shown in Fig. 4. Simulations confirm an optimum behavior of the whole system. At the moment the buffer circuits for input and output matching with the LVDS Standard as well as the clock generator for 4GHz and 250 MHz clock were not included in the simulations.

The power consumption of the 1:8 demux is 100 mW. Estimated power consumption for a LVDS buffer is 18 mW, and thus the total power budget for a 1-bit 1/16 demux is (18*16)+200mW = 480 mW.

**Layout:**

Layout of NOR gate is shown in Fig.5. NOR gate layout has been designed using the Ohmic Contact Sharing Technique (OEST) to minimize the surface. The elemental cell has the following dimension 100 x 70 µm²

![Fig. 5 – NOR Gate Layout](image)

Using the block diagram illustrated in Fig. 6 for the Flip Flop implementation a surface occupation of 250x350 µm for each flip flop has been estimated.

![Fig. 6 - Flip Flop layout block diagram.](image)

This results in a die area of 2000 x 1400µm² for the shiftregister/latch, and 3000 x 2000 µm² for the whole 1:16 demux including clock generator, buffer and level shifter.
Control FPGA

The output FPGA performs several tasks:

- Synchronizes the samples from the demultiplexer to a fixed-phase clock, derived internally from the 125 MHz system clock.
- Performs some statistics (mean, variance) of the sampled data, making them available to the standard interface (SPI AMBSI2 bus)
- Uses these data to adjust the sampler thresholds for symmetry/amplitude
- Drives the relatively long cable to the formatter unit

A single VIRTEX-II FPGA seems adequate to perform these tasks. The SPI interface uses a standard macro developed by the ALMA team. The statistics analyzer computes $x$ and $\bar{x}$ for each parallel sample using lookup tables, and integrate them using ripple counters. On every tick of the 48 ms clock, counts are latched and summed together. These values, compared to the theoretical ones, can be used to dynamically adjust the sampler threshold values.

Alternate design

At the moment, we are considering, besides the standard architecture of a direct 1:16 demultiplexer, a 1:8 demultiplexer followed by a final 1:2 stage in the FPGA. This option reduces the costs of the custom chips and the interconnection problems, at the expense of tighter speed requirements for the FPGA.

A 1/8 demultiplexer requires a 500 MHz clock, instead of 250 MHz. A dual data rate structure can be used, together with a 250 MHz clock, to implement in a simple way the 1:2 demux (Fig. 7). This clock can be generated internally by the FPGA, using its built-in PLL locked to the system 125 MHz clock. Using this architecture, however, timing margins become very critical. Some testing is thus required.

![Fig. 7: Two stage demultiplexer: 1:8 GaAs chip followed by a 1:2 demux in the FPGA](image)

Costs and time schedule

Using the layout considerations done previously, it seems reasonable to fit the whole 1-bit demultiplexer in a 2x3 mm die. Costs for prototyping 10 dice (unbounded) is 7128 Euro (VAT incl). For small scale production, costs are roughly reduced by 40%, i.e. 430E/chip. The approximate cost for the 3-bit demux, including packaging, is around 1500 Euro.

The cost for the multichip module, the PCB and the control FPGA has not been evaluated yet.

If the 1:8 demux architecture is used, a smaller chip (1x1.5 mm) can be used, and the cost (and power consumption) is halved. The increase in FPGA complexity is marginal, and the reduced pin count may cause a further cost reduction.

The next foundry run for this process has a submission deadline at September 2002, and expected delivery around April 2003.
References:


[3.] Barnes Z. “Microprocessor and CAN Interface” contribution to ALMA Backend PDR meeting, Granada, April 24-25, 2002


[9.] Mayvial Jean Yves, “Proposal for an all-commercial devices 4 GHz triple 1:16 demux” IRAM Internal note (29/05/2002) mayvial@iram.fr


[13.] Torres M. “Closed loop automatic adjustment of eye diagram” contribution to ALMA Backend PDR meeting, Granada, April 24-25, 2002