洛杉矶国际机场的无线电通讯系统失效的原因

软件故障,熟之过?

发信人: shrek2099 (怪物Shrek), 信区: ITExpress
标  题: 洛杉矶国际机场的无线电通讯系统失效的原因
发信站: BBS 水木清华站 (Wed Sep 22 09:41:58 2004), 站内

9月14号,洛杉矶国际机场的无线电通讯系统失效超过三个小时,造成800架飞机
与机场控制中心的联系中断,至少五起飞机间隔过近的事件,最终原因查明原来
是运行Windows2000的计算机服务器死机。造成死机的原因是因为Windows2000
服务器必须每隔49.7天就要重新启动一次,以防止数据过载。为了不让计算机自
动关机,系统维护人员负责每隔30天,手动重启计算机。一个没有经过很好训练
的家伙忘记重启机器,于是计算机在没有发出警告的情况下自己关机了,更糟糕
的是服务器的备份系统因为软件的毛病也失效了,结果事故就这样发生了。

洛杉矶机场的这套名为Voice Switching and Control System (VSCS)原本运行的
是Unix系统,三年前这套系统的计算机被换为Windows 2000 Advanced Server,
不过很快就发现升级后的系统会引起无线电通讯系统不正常的关机,于是系统维护
人员必须定时重启计算机,以暂时解决这个问题。结果这次因为系统没有定时重启
而造成了事故。

参见:http://www.techworld.com/opsys/news/index.cfm?NewsID=2275

//————————–我评论
增强易用性的同时,性能自然有很多隐患。这是微软的内疾。

但用户多,money多,反馈多,相信Win会越做越好。

先谈谈用户的问题

  软件设计出来,无法说满足所有人的意愿。VB父说过,要做好通用软件,满足100%人是不可能做到好的,只有让20%的人100%满足才是最好。

至于出这个大漏子,即使已经发现windows的问题,写应用程序和做系统集成的就肯定可以想到法子避免人工干预。

如做HACluster(高性能集群) 完全可以避免这些问题。更简单的是去打几个补丁,或者开发者绕开bug部分编写,就不会造成这么惊险的后果。

再谈谈windows的问题

确实不应该啊,这个49.7天的问题,还是有些学问的,几个高手讨论后,感觉有意思。

 Funny, no where in the doc for GetTickCount() [microsoft.com] does it say it is deprecated and not to use it. The only thing it does say is "If you need a higher resolution timer, use a multimedia timer or a high-resolution timer." I don’t know what the program needs since I did not write it nor have I seen the code. Maybe they didn’t need a high-res timer and wanted a tick count for how long the system has been up? I don’t think that is too much to ask from on OS.

The GetSystemTimeAsFileTime() [microsoft.com] function retrieves the current system date and time. The information is in Coordinated Universal Time (UTC) format. It doesn’t tell you how long the system has been up.

Oh, and if MS did not think this is a problem why did they fix it in a WinNT service pack [microsoft.com]? Also, right in that link MS says

    Microsoft has confirmed that this is a problem in Windows NT 4.0 and Windows NT Server 4.0, Terminal Server Edition. This problem was first corrected in Windows NT 4.0 Service Pack 4.0 and Windows NT Server 4.0, Terminal Server Edition Service Pack 4.

MS also didn’t seem to fix it in Win2000 Server and their own engineers got hurt by it, specifically with Rpcss.exe [microsoft.com] which according to MS

    SYMPTOMS

    The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.

    CAUSE

    This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.

If GetTickCount is "deprecated" as you state, why in the world is MS’s own programmers using it in rpcss.exe? According to this site [liutilities.com]

    rpcss.exe is an executable of Microsoft Windows Opearting System. It is reponsible for Remote Procedure Call services on the local machine. These are public services available to the local network. This program is important for the stable and secure running of your
computer and should not be terminated.

Still not convinced and want to appologize for MS? Well here are some more of MS’s software that are affected by it in Windows 2000 servers (what this FAA project is using). Print Spooler Stops Scheduling Print Jobs [microsoft.com]

    The Print Spooler service may stop scheduling print jobs to specific Simple Port Monitor (SPM) ports. Although incoming jobs are queuing into the spooler, print jobs may not start. Note that this symptom occurs 49.7 days after you start the Print Spooler service.

There are a bunch of MS apps affected by this logic flaw [microsoft.com] that has been passed from version to version of MS OSes. If this flaw affected all these MS developers who have far more access to proprietary docs, I don’t see how other developers would not stumble over it as well since they do not have access to the proprietary OS.
http://it.slashdot.org/it/04/09/21/2120203.shtml?tid=128&tid=103&tid=201

GetTickCount()函数的问题:

GetTickCount

The GetTickCount function retrieves the number of milliseconds that have elapsed since the system was started. It is limited to the resolution of the system timer. To obtain the system timer resolution, use the GetSystemTimeAdjustment function.

DWORD*GetTickCount(void);* Parameters

This function has no parameters.

Return Values

The return value is the number of milliseconds that have elapsed since the system was started.

Remarks

The elapsed time is stored as a DWORD value. Therefore, the time will wrap around to zero if the system is run continuously for 49.7 days.

If you need a higher resolution timer, use a multimedia timer or a high-resolution timer.

To obtain the time elapsed since the computer was started, retrieve the System Up Time counter in the performance data in the registry key HKEYPERFORMANCEDATA. The value returned is an 8-byte value. For more information, see Performance Monitoring.

Example Code

The following example demonstrates how to use a this function to wait for a time interval to pass. Due to the nature of unsigned arithmetic, this code works correctly if the return value wraps one time. If the difference between the two calls to GetTickCount is more than 49.7 days, the return value could wrap more than one time and this code will not work; use the system time instead.

DWORD dwStart = GetTickCount(); // Stop if this has taken too long if( GetTickCount() – dwStart >= TIMELIMIT ) Cancel(); Example Code

Note that TIMELIMIT is defined as the time interval of interest to the application, in milliseconds.

Requirements

*Client: *Requires Windows XP, Windows 2000 Professional, Windows NT Workstation, Windows Me, Windows 98, or Windows 95.
*Server: *Requires Windows Server 2003, Windows 2000 Server, or Windows NT Server.
*Header: *Declared in Winbase.h; include Windows.h.
*Library: *Use Kernel32.lib.

DWORD 最大值=65535*65535/3600/24/1
000=49.708…….天

固然会溢出。佩服康core记得这函数啊。

完全可以避免,微软也想不到他会这么用啊。

wingc

Read more posts by this author.