SDH transmission fault handling analysis

SDH transmission fault handling analysis

The routine maintenance work of the transmission system often requires us to locate and eliminate various faults in a timely manner. The key to fault location is to accurately locate the fault point to the board and then eliminate the fault. This requires a clear understanding of the cause of the fault, the way of thinking and the method of treatment, so that the effect of doing more with less can be achieved.


Basic principles of transmission fault location

As we all know, the transmission fault location should generally follow the principles of "grabbing first, then repairing, externally first, then transmission, first single station, then single board, first line, then branch, first high, then low".

1. First fix after repair

When a fault occurs, the system maintainer must first get through the business and then repair the fault. If there are transmission network alarm failures that affect business conditions, such as LOS (loss of signal) alarms on the 2Mbit / s service channel, non-light receiving alarms due to external lines, and UnitFailure (unit failure) alarms caused by unit disk failures, etc. For the faults generated under the system, you must first get through the business.

However, if you want to get through the service first, a prerequisite is that there are available channel resources in the network at the same starting point as the failed channel or available spare boards with the same failed board.

2. External transmission

When dealing with faults, first eliminate external possible factors, such as fiber breakage, terminal equipment faults, power supply or equipment room environment matching faults, etc., and then find the cause of the transmission system. When there may be external factors affecting the transmission network alarm failures, such as device temperature alarms, optical path alarms, and network element failure alarms, this principle must also be handled.

3. Single station first, then single board

When searching for the cause of transmission equipment failure, you need to locate the site first and then the board.

When a general equipment fails, the alarm will not be reported at one site, but will be reported at many sites at the same time. At this time, it is necessary to narrow the scope through analysis and judgment, quickly and accurately locate which single station is the problem, and then locate the fault to the single station as accurately as possible and then to the single board. For example, when handling alarm errors such as optical path errors and abnormal optical power, it is necessary to analyze the alarm and performance events together with the service signal flow. You can use the loopback method, alternative method, data analysis method, and instrument test method to determine the cause of alarms and faults, and locate it to the single board.

4. Line before branch

When dealing with faults, if a large number of AIS alarms appear on the branch, you need to first eliminate the circuit board failure and then check the branch board failure.

The failure of the transmission system circuit board often causes abnormal alarms of the tributary board. When handling the alarms, the network management alarms should be excluded in the order of "line first and tributary". If a LOS alarm or other abnormal alarm occurs, check the tributary board alarm.

5. High level first then low level

When performing alarm analysis, first analyze high-level alarms and then low-level alarms.

Especially when high-level and low-level alarms exist at the same time, you should first analyze high-level alarms, such as emergency alarms and major alarms, and then analyze low-level alarms, such as minor alarms and general alarms. When handling alarms, the system maintainer first handles alarms that affect the business. If these alarms are caused by higher-level alarms, the higher-level alarms, such as AIS and LOP, are processed first. If it is caused by LOS, the LOS alarm must be handled first.

Causes of transmission failure

There are many reasons for transmission failure. According to the source of the failure, it can be roughly divided into engineering construction defects, improper daily maintenance and operation, equipment docking failure, equipment external reasons and equipment itself reasons.

1. Non-standard construction and poor quality

Some of these faults can be exposed during construction, and some may be exposed only after the device has been running for a period of time or under certain external factors, thereby laying hidden dangers for the stable operation of the device. In order to prevent such failures, the construction personnel need to strictly follow the engineering specifications for construction and installation, and carefully carry out the single-point and entire network debugging and testing according to the specifications.

2. Improper daily maintenance operation

Due to the lack of in-depth and meticulous understanding of the system, the maintenance personnel are unclear about the details, performance characteristics and precautions of the specific equipment, as well as the characteristics and differences between the new and old equipment and the new and old versions. Such failures are most likely to occur when upgrading capacity, mixing old and new equipment and versions, using new versions of spare boards, and using boards that have not been system-tuned.

3. Device connection failed

Due to the diversity of transmission services and the complexity of business requirements for transmission channel performance, the connection of transmission equipment is very complicated. At this time, a series of problems are prone to occur, such as cable connection errors, equipment grounding does not meet the requirements, and clocks between transmission and switching networks Synchronization anomalies, differences in the definition of overhead bytes in the SDH frame structure, etc., can cause failures.

4. External reasons

Failures outside the equipment will also cause transmission failures. There are many external causes of transmission failure, including: 1. Power system and supporting failures. Such as AC power failure, DC power failure, fuse failure, power supply voltage is too low, poor grounding, environmental degradation, etc .; two, fiber optic cable failure. If the optical cable line is interrupted, the optical cable line loss is too large, the pigtail is broken, the pigtail bending radius is too small, the flange joint is dusty and the pigtail head is dirty, etc .; 3. The cable is faulty. Such as 2Mbit / s cable interruption, 2Mbit / s interface input and output ports fall off, loose contact caused by loose, etc .; 4. Switch failure.

5. The cause of the device itself

This means that the device itself is damaged or there is a problem with the board fit. Common phenomena are: 1. Single-disk failure. Such as circuit board, 2Mbit / s board, clock board, cross board, main control board and other components are damaged; Second, the network management system is faulty. Including the ECC channel interruption and crash caused by network cable failure or system abnormality between the NMS and the device.

It should be reminded that after the equipment has been running for a long time, the board will naturally age, and this type of failure due to equipment ageing also belongs to this category. Equipment aging faults have a common feature: the equipment has been used for a long time, and the equipment is basically normal before the fault. The fault only appears at individual points, individual boards, or under some external factors.


Troubleshooting ideas

When encountering a fault, the system maintainer should not panic, carefully check the fault phenomenon and analyze the possible causes, so as to be targeted and deal with quickly. Troubleshooting should generally follow the "look first, then ask, then think, and finally start" thinking.

When the system maintainer arrives at the scene, first check the phenomenon of the fault, including the location of the fault, which alarms, the severity of the fault, the harm caused, etc., in order to understand the nature of the fault.

After checking the phenomenon, the system maintainer should ask the on-site personnel at all stages, what is the cause of the phenomenon, such as whether someone has modified the data, deleted the file, and replaced the circuit board; whether it has encountered a power failure or lightning strike, and whether there is improper operation phenomenon.

Based on the above results, the system maintainer combines his knowledge to think and analyze, determine what causes may cause such failures, and make a more correct judgment. Finally, find the fault point according to the fault location principle, and eliminate the fault by modifying the data and replacing the single board.

Common transmission fault handling methods

Common transmission failure processing methods include observation analysis method, loopback test method, plug-in method, replacement method, configuration data analysis method, change configuration method, instrument test method and empirical processing method.

1. Observation Analysis

System failures are usually accompanied by corresponding alarm information. By observing the operation of the warning light, you can find the fault in time. When a fault occurs, the NMS will also record very rich alarm events and performance data information. By analyzing this information, combined with the overhead bytes in the SDH frame structure and the SDH alarm principle mechanism, the fault type and the location of the fault point are initially determined.

2. Loopback test method

Sometimes the observation and analysis method can not solve the problem, such as the complicated situation of networking, business and fault information and the special fault condition without obvious alarm and performance information. System maintainers can use the maintenance functions provided by the network management system to test and determine the point and type of failure. The most common method is loopback.

Loopback is the most effective and commonly used method for locating fault points. It does not require too in-depth analysis of alarms and performance. The disadvantage is that it will affect the business. It is generally used when the business volume is small.

3. Plug-in method

When a circuit board is found to be faulty, the system maintainer can remove the fault caused by poor contact or abnormality of the processor by plugging and unplugging the circuit board and the external interface plug. When plugging and unplugging, the system maintainer should pay attention to the single-board plugging and unplugging operation specifications, so as not to cause other problems or even damage the board.

4. Replacement method

When the plug-in method cannot solve the problem, the replacement method can be considered. The replacement method is to use a normal spare part to replace a component that is suspected of malfunctioning, so as to locate and eliminate the fault.

The replacement method is applicable to exclude the problem of transmitting external equipment, such as optical fibers, trunk cables, switches, power supply equipment, etc. Or after the fault is located to a single station, it is used to eliminate the problem of the single board in the single station. If there is an alarm on the optical panel of a station, and we suspect that the receiving and sending optical fibers are reversed, we can exchange the receiving and sending optical fibers. If the alarm of the optical panel disappears after the exchange, it means that the optical fiber is reversed.

The advantage of the replacement method is that the method is simple, the requirements for maintenance personnel are not high, and it is more practical, but there are requirements for spare parts. In addition, when replacing the plug-in circuit board, you need to follow the operating specifications.

5. Configuration data analysis method

The configuration data analysis method refers to a troubleshooting method for auxiliary judgment and processing of alarms by means of overhead byte configuration and status analysis, and changes in cross-connections.

The advantage of the configuration data analysis method is that it does not affect the business, does not require instruments, can correctly identify the hardware connection misalignment, and has high processing efficiency. However, the fault location time is relatively long, and the requirements for maintenance personnel are very high. Generally, only maintenance personnel who are very familiar with the equipment and very experienced can use it. When using this method, try to select trace bytes and other status bytes that do not affect services, such as J0 / J1 / V3.

6. Change the configuration method

The configuration change method is to reconfigure the time slot, board position, and board parameters. Therefore, it is suitable for locating the fault to a single site to eliminate the fault caused by the configuration error.

When the fault cannot be accurately located by changing the time slot configuration, it is necessary to further locate the fault through the replacement method. Therefore, this method is suitable for locating the fault type in the absence of a standby board and temporarily restoring services using other service channels or board positions.

This method is relatively complicated to operate and requires high level of maintenance personnel. Therefore, unless it is used to temporarily restore services without a standby board or used to locate pointer adjustment problems, it is generally not recommended.

7. Instrument test method

The instrument test method is generally used to eliminate external problems of transmission equipment and docking problems with other equipment. Commonly used test instruments for transmission equipment include 2Mbit / s error code tester, SDH tester, and spectrum analyzer.

It is more accurate to analyze and locate the fault through the instrument test method. The disadvantage is that there is a demand for the instrument, and the requirements for the maintenance personnel are also higher.

8. Experience approach

In some special cases, such as abnormal power supply, low voltage or strong external electromagnetic interference, some boards of the transmission equipment enter an abnormal working state. At this time, the fault phenomenon, such as business interruption and ECC communication interruption, may be accompanied by corresponding alarms, or there may not be any alarms. It may be completely normal to check the configuration data of each board. Experience has proved that in this case, the system maintainer can effectively eliminate the fault and restore the business in a timely manner by resetting the board, powering off and restarting the NE, re-delivering the configuration, or switching the service to the backup channel.

It is recommended that system maintainers use this method as little as possible, because this method is not conducive to a thorough investigation of the cause of the failure. In this case, unless the situation is urgent, you should generally try to use the methods described above or request technical support through the correct channel to locate the fault as much as possible to eliminate the hidden dangers inside and outside the device.

Typical case analysis

In order to better understand the transmission fault processing ideas and methods, several typical cases are now analyzed.

1. Transmission 2Mbit / s line failure caused business interruption

Failure phenomenon: 2Mbit / s service interruption of a certain network element, the transmission equipment is Huawei 155 / 622H, and there is a T-LOS alarm on the transmission equipment.

Failure analysis: Due to the T-LOS alarm on the transmission equipment, the system maintainer can be sure that there is no problem with the optical path of the transmission equipment. Because there is a fault on the 2Mbit / s circuit from the transmission device to the network element, the problem may be the SP1D or 2Mbit / s line fault on the electrical interface board of the transmission device.

Fault location and resolution steps: first arrive at the station to determine the fault point by means of loopback. When performing remote loopback on the DDF rack, the transmission device still has a T-LOS alarm, and the path is normal when performing near-end loopback, indicating that it is DDF There is a problem with the 2Mbit / s line from the SP1D board to the SP1D board to the DDF frame, which is consistent with the initial analysis. Because the 2Mbit / s line from the SP1D board to the DDF rack is a finished line and cannot be repaired, only after successfully swapping the unused 2Mbit / s finished line on the SP1D board and the fault line, the alarm is eliminated, and the 2Mbit / s is cleared. Before returning to normal. In order to ensure the normal opening of the business in the future, the system maintainer will finally replace the broken 2Mbit / s product line to ensure that all 2Mbit / s lines are normal.

Conclusion: This fault is a typical case of business interruption caused by a 2Mbit / s line fault. In this case, we pre-analyzed the fault as a 2Mbit / s circuit fault according to the fault phenomenon, and judged the fault by the most conventional loopback method Point, quickly recover the business after finding the point of failure.

2. Abnormal voltage causes business interruption

Trouble phenomenon: The transmission network of a certain office is composed of 4 OpTIX2500 devices to form a bidirectional multiplex section protection ring, and the network element No. 1 is the service center point, and is connected with a network management computer. On a certain day, the service of network element 3 of the bureau was interrupted, the slave network management system could not log in to the network element, and network elements 2 and 4 reported the "R-LOS" alarm corresponding to the network board of network element 3.

Fault analysis and elimination: From the analysis of the fault phenomenon, it should be that the No. 3 network element is powered off; when the system maintainer rushes to the No. 3 network element, it is found that the rack alarm light and the single board indicator light of the No. 3 network element are all off. Measure the voltage between -48V and BGND terminal is 0V; measure the output voltage of the power supply device, it is -53.7V, but the power supply device has an abnormal output voltage alarm.

At this time, the system maintainer suspects that the transmission equipment is internally short-circuited. Turn off the power switch of the subrack of the No. 3 network element cabinet, and re-measure the voltage between -48V and the BGND terminal, which is -20.39V. It may be that the cabinet power box is partially short-circuited and the potential is pulled down.

Turn off the power supply to the power supply of the transmission device (unplug the power fuse), disconnect the transmission device power input cable, and measure the resistance between -48V and BGND when the subrack power switch is turned off, which is several thousand ohms, normal .

The system maintainer suspects that the voltage drop of the power cable is too large, and the resistance of the -48V and BGND cables is measured to be a few ohms, which is normal. Further measurement of the unplugged power fuse reveals that its resistance has reached ten thousand ohms, and the cause of the fault lies in the power fuse.

Put on the normal insurance and power on network element 3 again, everything is normal.

Conclusion: The power supply insurance is damaged, but it is not open. Because its resistance value becomes very large, although the output voltage is normal, there is no ability to load. Therefore, when the output voltage of the transmission equipment is abnormal, the system maintainer must also check the power supply equipment in addition to the failure factors of the transmission equipment.

3. Line interruption caused by artificial piracy interruption

Fault phenomenon: XXX01 to XXX network elements of a transmission network, the two transmission network elements report R-LOS alarms to each other, and some network elements on the ring report PS alarms, all network elements can log in, the ring and the ring to the chain The business between them is protected without business interruption. The alarm information includes R-LOS, PS, and some spare channels TU-AIS.

Cause analysis: Since there is no service interruption, it indicates that the ring switching is normal, and the two network elements mutually guarantee R-LOS alarm and can log in. It is preliminarily determined that the optical cable of the trunk section is broken or the equipment pigtail is faulty.

The ring is a PP ring, and its protection attribute is tributary board switching. When the primary channel fails, the tributary board will switch to the other direction for selective reception services, and simultaneously report PS alarms and TU-AIS alarms for the backup channel, which is normal. Alarm.

Processing process: Check the equipment pigtails, connectors, and optical boards. Everything is normal, and it is determined that the optical cable is faulty. After the OTDR test, it was found that the optical cable at the 1.2km out of the XXX01 network element was broken, and the line was patrolled to the 1.2km out of the XXX01 network element. Repair the optical cable and solve the fault.

Conclusion

In the routine maintenance of SDH transmission networks, we often encounter various fault phenomena accompanied by different alarm indications, and sometimes even the same alarm indications appear. It seems that the same fault is caused by different reasons. Only by finding the essence through the appearance of the fault can the fault be accurately located and quickly eliminated. This requires us to understand the basic principles of fault location, clarify the fault handling ideas, and master common fault handling methods, so as to calmly respond to various abnormal phenomena and improve the effectiveness of daily maintenance.

48V Power Battery

48V Power Battery,Lithium Ev Battery,Power Lithium Battery,Customized Lithium Battery

Sichuan Liwang New Energy Technology Co. , https://www.myliwang.com