分布式TensorFlow入坑指南,队列等待
分类:计算机网络

早上开掘activity monitor 中有等待的job,查看details现实cloudnot connect to Mediaserver:XXX。

翻译自《Netflix Tech Blog》,原作者Brendan Gregg

姓名:吴兆阳  学号:14020199009

先看看Mediaserver是不是符合规律,在device中media server那项,开掘mediaserver的status是active for disk,未有tape,平常景况应当是active for tape and disk,是还是不是服务器不能识别tape了?

NBUversion:7.5

Linux Performance Analysis in 60,000 Milliseconds

登录黄金年代台 Linux 服务器每一种审核质量难点:起先一分钟你该检查哪些吗?
在 Netflix 大家有二个小幅度的 EC2 Linux集群,也是有过多属性剖判工具用于监视和自己议论它们的性质。它们包涵用于云监测的Atlas (工具代号卡塔尔 ,用于实例拆解分析的 Vector (工具代号卡塔尔(英语:State of Qatar) 。

就算那一个工具能辅助大家减轻大多数难题,大家偶然也须求登入风度翩翩台实例、运转一些专门的学问的 Linux 品质深入分析工具。

在此篇作品,Netflix 质量工程团队将向你呈现:在开首的60分钟,利用专门的学业的Linux命令行工具,试行一回丰富的天性检查。

转自机器机器之心

登陆到Mediaserver上,在OS层面未有开掘非常,运营vmoprcmd tpconfig -l 命令也开采未有丰盛,查看一下进度:bpps -i MM_all,ltid进度机械手经过)存在,没分外,重启服务 bpdown -f -v bpup -f -v后,故障依然。

MediaServer:WindowsServer 2008R2

黄金60秒:概述

运营以下十三个指令,你能够在60秒内,获得系统财富利用率和经过运转状态的欧洲经济共同体概念。查看是不是存在特别、评估饱和度,它们都非常便于精晓,可用性强。饱和度表示资源还应该有多少负荷能够让它管理,并且能够显得须要队列的尺寸或等待的小时。

uptime
dmesg | tail vmstat 1
mpstat -P ALL 1 pidstat 1
iostat -xz 1 free -m
sar -n DEV 1
sar -n TCP,ETCP 1 top

翻译配图:perf check path

那几个命令须要设置sysstat包。
那么些命令输出的目标,将协助你精晓一些管用的方法:一条龙搜寻质量瓶颈的方法论。这几个命令需求检讨有着财富的利用率、饱和度和错误消息(CPU、内部存款和储蓄器、磁盘等)。同偶然候,当您检查或死灭部分财富的时候,要求注目的在于检查进度中,依据目的数据辅导,稳步减少指标限定。

接下去的章节,将整合临蓐碰着的案例演示那几个命令。假设愿意了然那几个工具的详细音信,能够查看它们的操作文书档案。

嵌牛导读:通过多 GPU 并行的法子得以有很好的增长速度效果,可是豆蔻年华台机器上所帮衬的 GPU 是个别的,因而本文介绍了分布式 TensorFlow。布满式 TensorFlow 允许我们在多台机器上运维四个模子,所以训练进度或加速效果能鲜明地进级。本文简要概述了布满式 TensorFlow 的原理与实行,希望能为计划入坑遍及式锻练的读者提供一些介绍。不幸的是,关于布满式 TensorFlow 的官方文书档案过于简短。大家须求二个微微易懂的牵线,即透过 Jupyter 运维一些中坚例子。

备份内容:SQLServer 数据

1. uptime

$ uptime
23:51:26up21:31, 1user, loadaverage:30.02,26.43,19.02

那是三个火速查看平均负载的方法,表示等待运行的天职(进程)数量。
在Linux系统中,这一个数字带有等待CPU运转的长河数,也包涵不间断I/O堵塞的进程数(日常是磁盘I/O)。

它显得了贰个能源负载(或要求)的完全概念,然而力所不及绪解里面包车型地铁内蕴,在还未有别的工具的动静下。仅仅是后生可畏种高效查看手腕而已。

那三个数字展现出平均负载在几何级裁减,依次表示持续1分钟,5分钟和15分钟内。那四个数字能告诉大家负载在时光线上是怎样变迁的。

举个例子表明,借使你在一个难点服务器上实行检查,1分钟的值远远低于15分钟的值,能够看清出您可能登陆得太晚了,已经错过了难点。

在地点的例证中,平均负载的数值呈现近年来正值上升,1分钟值高达30,相比较15分钟值则是19。这几个目标值像今后那般大表示相当多情状:大概是CPU繁忙;vmstat 或许 mpstat 将得以确定,本连串的第三和第四条命令。

嵌牛鼻子:布满式TensorFlow

这篇文书档案介绍了更改media server的status

带库: IBM3584

2. dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-r
ss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP cou
nters.

以此结果输出了近来10条系统音讯。
能够查阅到引起质量难题的谬误。下边包车型大巴事例包涵了oom-killer,甚至TCP丢包。

PS:这么些实在非常轻巧忽略啊,真真的踩过坑!! 此外,除了error级的日志,info级的也要留个心眼,大概带有部分隐形音讯。

[译者注:oom-killer]
风姿浪漫层保险机制,用于制止 Linux 在内存不足的时候不至于出太严重的题目,把无关大局的长河杀掉,有个别获兔烹狗的意思

嵌牛提问:什么是分布式TensorFlow,怎么样入坑?

master server上运行

巡检NBU时在activity monitor中发觉一条备份失败job,状态码为52,如图:

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0
32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0
32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0
32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0
32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0

vmstat 是三个得到设想内部存储器状态概略的通用工具(最先创造于10年前的BSD)。它每后生可畏行记录了最首要的服务器计算音讯。
vmstat 运维的时候有一个参数1,用于输出风度翩翩分钟的上校数据。
第后生可畏行输出展现运维之后的平均值,用以代表在此以前的豆蔻梢头分钟数据。

如今,跳过第焕发青春行,让大家来学习而且记住每一列代表的意义。

r:正在CPU上运营或等候运转的进度数。
相对于平均负载来讲,那提供了三个更加好的、用于查明CPU饱和度的目的,它不包罗I/O负载。注: “r”值超过CPU数便是饱和。

free: 空闲内部存款和储蓄器(kb卡塔尔(قطر‎
举个例子这些数值十分大,注明你还会有丰富的内部存款和储蓄器空闲。
总结命令7“free m”,很好地显现了空闲内部存款和储蓄器的事态。

si, so: swap入/出。
借使这几个值非0,注脚内部存款和储蓄器溢出了。

us, sy, id, wa, st:
它们是CPU分类时间,针对全部CPU的平分访谈。
独家是客户时间,系统时间(内核),空闲,I/O等待时间,以至被偷窃的时辰(别的访客,恐怕是Xen)。CPU分类时间将能够扶持确认,CPU是或不是繁忙,通过累加客户系统时间。

等候I/O的场馆肯定指向的是磁盘瓶颈;这时候CPU日常是悠闲的,因为任务被卡住以伺机分配磁盘I/O。你能够将拭目以俟I/O当做另大器晚成种CPU空闲,意气风发种它们为啥空闲的分解线索。

系统时间对I/O管理特别供给。二个相当高的平均系统时间,超越五分一,值得长远解析:大概是内核管理I/O比极低效。

在上边的事例中,CPU时间大概全盘是客户级的,与使用程序级的利用率适逢其会相反。全数CPU的平分利用率也超越十分之九。那不一定是四个标题;还需检查“r”列的饱和度。

嵌牛正文:

nbemmcmd -updatehost -machinename <mediaserver hostname> -machinetype media -machinestateop set_tape_active -masterserver <masterserver hostname>

图片 1

4. mpstat P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03
[...]

本条命令能够按期间线打字与印刷种种CPU的损耗,平常用于检查不平均的主题素材。
只要独有一个无暇的CPU,能够看清是归属单进度的应用程序。

简介

主题素材解除。

在device monitor中窥见该MediaServer驱动错失路线:

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat
07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C

pidstat 有有个别像顶尖视图-针对每几个经过,可是出口的时候滚屏,而不是清屏。
它可怜有用,非常是跨时间段查看的方式,也能将您所看见的音讯记录下来,以利于进一层的琢磨。
地点的例证识别出两个 java 进度引起的CPU耗尽。
“%CPU” 是对具有CPU的成本;1591% 显示 java 进度占用了大致十四个CPU。

importtensorflowastf

正文出自 “学习笔记” 博客,请必需保留此出处

图片 2

6. iostat xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04
dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00
dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03
[...]

那是三个亮堂块设备(磁盘)极好的工具,无论是负载评估依旧作为品质测量试验成绩。

r/s, w/s, rkB/s, wkB/s: 那些是该设备每秒读%、写%、读Kb、写Kb。可用来描述专门的学业负荷。叁个属性难题只怕只是简短地由于四个超过的载重引起。

await: I/O平均时间(纳秒)
那是应用程序须要的岁月,它回顾排队以致运营的小时。
迢迢超过预期的平均时间能够用作设备饱和,或然配备问题的目标。

avgqu­sz: 向设备产生的平分央浼数。
值超越1可身为饱和(纵然设备能对央浼持续运营,极其是后面一个的虚构设备-后端有多少个磁盘)。

%util: 设备利用率
那是叁个实时的无暇的百分比,展现设备每秒钟正在张开的专业。
值超越三分一归属规范的性情不足(能够从await处查看),即便它决计于设备。
值附近100% 经常提示饱和。

万风姿洒脱存款和储蓄设备是三个前端逻辑磁盘、后挂一群磁盘,那么100%的利用率或许意味着,一些业已管理的I/O当时占用百分之百,然则,后端的磁盘大概远远未有直达饱和,其实能够承担越来越多的干活。

铭记:磁盘I/O质量低并不一定是应用程序难题。许多技巧一定采取异步I/O,所以应用程序并不会卡住,以致际遇直接的延期(比如提前加载,缓冲写入)。

举个例子说,大家目的在于四个经过分享一些联合具名的参数。为了轻松起见,如果那只是一个十足的变量:

monitor 中有等待的job,查看details现实cloudnot connect to Mediaserver:XXX。 先看看Mediaserver是还是不是寻常,在device中media server那项,开掘...

1 TroubleShooting

7. free m

$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 59 541
-/+ buffers/cache: 23944 222053
Swap: 0 0 0

buffers: buffer cache,用于块设备I/O。
cached:page cache, 用于文件系统。

笔者们只是想检查那一个指标值不为0,那样意味着磁盘I/O高、品质差(确认须要用iostat)。
上边包车型大巴例子看起来不错,每多个都有过多Mbytes。

“­/+ buffers/cache”: 提供了有关内部存款和储蓄器利用率越发正确的数值。

Linux能够将空闲内部存储器用于缓存,並且在应用程序须要的时候收回。
由此选用到缓存的内部存款和储蓄器必得以另风流洒脱种形式包括在内部存款和储蓄器空闲的多少里面。
以至有二个网站linux ate my ram,特意商讨那一个质疑。

它还也许有更令人纠缠的地点,假诺在Linux上选择ZFS,正如大家运营一些劳动,ZFS具备本身的文件系统混存,也不可能在free -m 的出口提辖确反映。

这种状态博览会示系统空闲内部存款和储蓄器不足,可是内存实际上可用,通过回笼 ZFS 的缓存。

var = tf.Variable(initial_value=0.0)

先是在MediaServer上查看一下dirve:

8. sar n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C

利用那些工具用来检查互连网接口吞吐量:
rxkB/s 和** txkB/s**, 作为负载的生龙活虎种衡量方式, 也足以用来检查是不是早就到达某种瓶颈。

在下面的事例中,网卡 eth0 收包大道 22 Mbytes/s, 即176 Mbits/sec (就是说,在 1 Gbit/sec 的范围之内卡塔尔(قطر‎。

此版本也许有一个呈现设备利用率的 “%ifutil” (七个样子最大值),大家也足以接收 Brendan的 nicstat 工具来衡量。
和 nicstat 形似,那么些值很难准确获取,看起来在此个例子中并从未起成效(0.00)。

率先步,大家要求为种种进度创设和煦的对话。(若是 sess1 在三个历程中创制,而 sess2 会在另八个过程中开创)。

C:Program FilesVeritasVolmgrbin>tpconfig -l

Type Num Index Type DrNum Status Comment Name Path

robot 0 - TLD - - - - bcsyyfx

drive - 9 hcart2 8 DOWN - IBM.ULT3580-TD5.009 MISSING_PATH:{4,0,6,0}:00078AD2A

9. sar n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00
12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00
12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C

那是二个第风姿浪漫TCP目标的大概浏览视图。满含:

active/s: 当地最早化的 TCP 连接数 /每秒(比如,通过connect(卡塔尔 )
passive/s: 远程开头化的 TCP 连接数/每秒(比方,通过accept(卡塔尔(قطر‎ )
retrans/s: TCP重发数/每秒

那些外向和低落的流速計经常作为生龙活虎种简单的劳务负载衡量方式:新吸收接纳的连天数 (被动的卡塔尔(英语:State of Qatar),甚至下行流量的接连几天数 (活跃的卡塔尔(英语:State of Qatar)。

那说不许能扶持大家知道,活跃的都以活泼的,被动的都以内向的,可是严刻来讲这种说法是不纯粹的(举例,考虑到“本地-当地”的总是)。

重发数是网络或服务器难题的三个标记;它或者是因为离谱赖的互联网(如,公共网络),只怕是出于豆蔻梢头台服务器已经过度、发生丢包。

地点的例子显示每分钟只有三个新的TCP连接。

sess1 = tf.Session()

能够观察地点{4,0,6,0}连串号00078AD2A的驱动错过了路线,和device monitor彰显同后生可畏;NBU是经过操作系统来赢得tape drive的路径的,所以先在器材微处理机中检查一下drive的情形

10. top

$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched

top命令富含了数不完我们前边已经济检察查的目的。

它能够十三分有助于地运作,看看是否任何事物看起来与从后边的命令的结果完全区别,能够注脚负载指标是不停改换的。

顶上部分下边包车型客车出口,很难依照时间推移的格局查看,或然行使如 vmstat 和 pidstat 等工具会更明显,它们提供滚动输出。

大器晚成经您保持出口的动作非常不够快 (CtrlS 要暂停,CtrlQ 继续),荧屏将免去,间歇性难题的证据也会废弃。

sess2 = tf.Session()

图片 3

追踪解析

您还足以品味越来越多、越来越深的命令和方法。

详见Brendan的 Linux 品质工具引导课,包含40各个命令,覆盖可观测性、标杆管理、调优、静态质量优化、监视和追踪。

传说可扩充的web,清除系统的扩充和性指摘题,是大家着力的言情。

固然您希望能够减轻那类挑战,参预大家!

Brendan Gregg

以下为原来的文章:

sess1.run(tf.global_variables_initializer())

在设施微处理机中式茶食击展现隐敝设备,查看Tape dirve,未有意识未设置驱动的装置,何况能够见见驱动都平常工作,通过查阅各种dirve的属性,发今后操作系统下种类号为00078AD2A的驱动地址是{4,0,5,0}port=4 bus=0 target=5 lun=0,那与NBU中显得的{4,0,6,0}不相符,难怪nbu找不到path,有十分大可能是重新安装tape驱动或然总线地点发生了转移,系统下的地址发生了转移而NBU未有对症用药的改变,招致NBU下的计划与当前系统意况不生龙活虎致,NBU的dirve配置是积攒在EMM中的。那么上边就更新一下EMM中的配置就可以。

Linux Performance Analysis in 60,000 Milliseconds

You login to a Linux server with a performance issue:
what do you check in the first minute?

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance.

These include Atlas for cloud­wide monitoring, and Vector for on­demand instance analysis.

While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools.

In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available.

sess2.run(tf.global_variables_initializer())

2重新配置dirve path

First 60 Seconds: Summary

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands.

Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization.

Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.

uptime  
dmesg | tail vmstat 1  
mpstat -P ALL 1 pidstat 1  
iostat -xz 1 free -m  
sar -n DEV 1  
sar -n TCP,ETCP 1 top

Some of these commands require the sysstat package installed.
The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks.

This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.).

Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation.

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages.

每一趟调用 tf.Session(卡塔尔(英语:State of Qatar)都会创建四个独门的「试行引擎」,然后将会话句柄连接到实行引擎。推行引擎是实际存款和储蓄变量值并运转操作的事物。且 Python 天生是面向对象的编制程序,它里面包车型客车成分都以类或对象,由此更标准地说,tf.Seesio()是 TensorFlow 中的一个办法,它会张开二个会话并运转总括图。

第生机勃勃需求删除错误的dirve,使用tpconfig命令

1. uptime

$ uptime  
23:51:26up21:31, 1user, loadaverage:30.02,26.43,19.02

This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O).

This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools.

Worth a quick look only.

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant.

The three numbers give us some idea of how load is changing over time.

For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue.

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value.

That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence.

平日,差异进程中的试行引擎是不相干的。在三个对话中改换变量(在一个举办引擎上)不会听得多了就能说的详细此外会话中的变量。

<NBUinstall_path>Volmgrbin>tpconfig -delete -drpath -port 4 -bus 0 -target 6 -lun0 -asciiname IBM.ULT3580-TD5.009

2. dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-r
ss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP cou
nters.

This views the last 10 system messages, if there are any.

Look for errors that can cause performance issues.

The example above includes the oom­killer, and TCP dropping a request.
Don’t miss this step! dmesg is always worth checking.

print("Initial value of var in session 1:", sess1.run(var))

双重起动NBU服务

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
34 0 0 200889792 73708 591828 0 0 0 5 6 10 96 1 3 0 0
32 0 0 200889920 73708 591860 0 0 0 592 13284 4282 98 1 1 0 0
32 0 0 200890112 73708 591860 0 0 0 0 9501 2154 99 1 0 0 0
32 0 0 200889568 73712 591856 0 0 0 48 11900 2459 99 0 0 0 0
32 0 0 200890208 73712 591860 0 0 0 0 15898 4840 98 1 1 0 0

Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago).

It prints a summary of key server statistics on each line.

vmstat was run with an argument of 1, to print one second summaries.

The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second.

For now, skip the first line, unless you want to learn and remember which column is which. Columns to check:

r:
Number of processes running on CPU and waiting for a turn.
This provides a better signal than load averages for determining CPU saturation, as it does not include I/O.

To interpret: an “r” value greater than the CPU count is saturation.

free: Free memory in kilobytes.

If there are too many digits to count, you have enough free memory.

The “free ­m” command, included as command 7, better explains the state of free memory.

si, so:
Swap­ins and swap­outs. If these are non­zero, you’re out of memory.

us, sy, id, wa, st:
These are breakdowns of CPU time, on average across all CPUs.
They are user time, system time (kernel), idle, wait I/O,
and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time.
A constant degree of wait I/O points to a disk bottleneck;
this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle.

System time is necessary for I/O processing.

A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.

In the above example, CPU time is almost entirely in user­level, pointing to application level usage instead.

The CPUs are also well over 90% utilized on average.
This isn’t necessarily a problem; check for the degree of saturation using the “r” column.

print("Initial value of var in session 2:", sess2.run(var))

<NBUinstall_path>Netbackupbin>bpdown -f -v

<NBUinstall_path>Netbackupbin>bpup -f -v

4. mpstat ­P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:38:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
07:38:50 PM all 98.47 0.00 0.75 0.00 0.00 0.00 0.00 0.00 0.00 0.78
07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.00 0.00 0.00 0.99
07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
07:38:50 PM 3 96.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.03
[...]

This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance.

A single hot CPU can be evidence of a single­threaded application.

sess1.run(var.assign_add(1.0))

查看dirve的状态

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
07:41:02 PM UID PID %usr %system %guest %CPU CPU Command
07:41:03 PM 0 9 0.00 0.94 0.00 0.94 1 rcuos/0
07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave
07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java
07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java
07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java
07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat
07:41:03 PM UID PID %usr %system %guest %CPU CPU Command
07:41:04 PM 0 4214 6.00 2.00 0.00 8.00 15 mesos-slave
07:41:04 PM 0 6521 1590.00 1.00 0.00 1591.00 27 java
07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java
07:41:04 PM 108 6718 1.00 0.00 0.00 1.00 0 snmp-pass
07:41:04 PM 60004 60154 1.00 4.00 0.00 5.00 9 pidstat
^C

Pidstat is a little like top’s per­process summary, but prints a rolling summary instead of clearing the screen.

This can be useful for watching patterns over time, and also recording what you saw (copy­n­paste) into a record of your investigation.

The above example identifies two java processes as responsible for consuming CPU.

The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs.

print("Incremented var in session 1")

<NBUinstall_path>Volmgrbin>vmoprcmd


6. iostat ­xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
73.96 0.00 3.73 0.03 0.06 22.21
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09
xvdb 0.01 0.00 1.02 8.94 127.97 598.53 145.79 0.00 0.43 1.78 0.28 0.25 0.25
xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26
dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04
dm-1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00
dm-2 0.00 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03
[...]

This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance.
Look for:
r/s, w/s, rkB/s, wkB/s:
These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device.
Use these for workload characterization.
A performance problem may simply be due to an excessive load applied.

await:
The average time for the I/O in milliseconds.
This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.

avgqu­sz:
The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back­end disks.)

%util:
Device utilization.
This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device.
Values close to 100% usually indicate saturation.

If the storage device is a logical disk device fronting many back­end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back­end disks may be far from saturated, and may be able to handle much more work.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue.

Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read­ahead for reads, and buffering for writes).

print("Value of var in session 1:", sess1.run(var))

举个例子退换了dirve配置而再次启航NBU服务,那么dirve的景观就能够成为restart,表示要求重启进程

7. free ­m

$ free -m
total used free shared buffers cached
Mem: 245998 24545 221453 83 -/+ buffers/cache: 23944 222053  
Swap: 0 0 0

The right two columns show:
buffers: For the buffer cache, used for block device I/O.
cached: For the page cache, used by file systems.

We just want to check that these aren’t near­zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each.

The “­/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion.

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free ­m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed.

print("Value of var in session 2:", sess2.run(var))

再次定义新的dirve path

8. sar ­n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.00 0.00
12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.00
12:16:49 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:16:49 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 0.00
12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 0.00
12:16:50 AM docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
^C

Use this tool to check network interface throughput:
rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached.

In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit).

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00).

上边代码块的输出结果为:

在MasterServer上运转Admin Console,使用指点"ConfigureStrage Devices"来安顿dirve,作者在采用向导中甄选MediaServer后现身了有个别小场地,NBU报错:“error connecting to oprd on xxxx: oprd returned an abnormal status (96卡塔尔(قطر‎”masterserver连接不到mediaserver。

9. sar ­n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015 _x86_64_ (32 CPU)
12:17:19 AM active/s passive/s iseg/s oseg/s
12:17:20 AM 1.00 0.00 10233.00 18846.00
12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:20 AM 0.00 0.00 0.00 0.00 0.00
12:17:20 AM active/s passive/s iseg/s oseg/s
12:17:21 AM 1.00 0.00 8359.00 6039.00
12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
12:17:21 AM 0.00 0.00 0.00 0.00 0.00
^C

active/s: Number of locally­initiated TCP connections per second (e.g., via connect()).

passive/s: Number of remotely­initiated TCP connections per second (e.g., via accept()).

retrans/s: Number of TCP retransmits per second.

The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active).

It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection).

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per­second.

Initialvalue of varinsession1:0.0

google了一下找到一篇教程,须要重新配置mediaserver,如下:

10. top

$ top
top - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92
Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie
%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers
KiB Swap: 0 total, 0 used, 0 free. 554208 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20248 root 20 0 0.227t 0.012t 18748 S 3090 5.2 29812:58 java
4213 root 20 0 2722544 64640 44232 S 23.5 0.0 233:35.37 mesos-slave
66128 titancl+ 20 0 24344 2332 1172 R 1.0 0.0 0:00.07 top
5235 root 20 0 38.227g 547004 49996 S 0.7 0.2 2:02.74 java
4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java
1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:05.35 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:06.94 kworker/u256:0
8 root 20 0 0 0 0 S 0.0 0.0 2:38.05 rcu_sched

The top command includes many of the metrics we checked earlier.
It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable.

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output.

Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl­S to pause, Ctrl­Q to continue), and the screen clears.

Initialvalue of varinsession2:0.0

1关闭MediaServer上的NBU服务

Follow­on Analysis

There are many more commands and methodologies you can apply to drill deeper.
See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing.

Tackling system reliability and performance problems at web scale is one of our passions.

If you would like to join us in tackling these kinds of challenges we are hiring!
Posted by Brendan Gregg

Incrementedvarinsession1

<NBUinstall_path>Netbackupbin>bpdown-f -v

Valueof varinsession1:1.0

2更改NBU安装路线下Volmgrmisc目录下的文书vmd.lock,将其重命名或删除

Valueof varinsession2:0.0

3启动MediaServer服务

对此布满式 TensorFlow,我们率先须求领悟它的基本原理。以下代码体现了怎么着塑造三个最轻松易行 TensorFlow 集群,以援助大家驾驭它的基本原理。

<NBUinstall_path>Netbackupbin>bpup -f -v

importtensorflowastf

4在MasterServer上行使nbrbutil工具配置MediaServer

c=tf.constant("Hello, Distributed TensorFlow!")

<NBUinstall_path>Netbackupbinadmincmdnbrbutil -resetMediaServer xxxx

# 成立一个本地TensorFlow集群

xxx为MediaServer的主机名

server=tf.train.Server.create_local_server()

运用resetMediaServer参数前需求在乎两点,一是先利用nbrbutil -dump 确定保障未有device Allocations在EMM中卡住;二是承保未有备份的job运维,nbrbutil会甘休当前的job

# 在集群上制造二个对话

5重新利用向导Configure Strage Devices

sess=tf.Session(server.target)

慎选MediaServer后直接下一步,为dirve增添新路径

print(sess.run(c))

最终在MediaServer使用vmoprcmd命令,所以新drive的景况为TLD,至此dirve path加多OK,手动备份测量检验日常,马到成功。

在以上代码中,大家先经过 tf.train.Server.create_local_server 在地点成立二个只有意气风发台机器的 TensorFlow 集群。然后在集群上生成叁个对话,通过该对话,大家能够将成立的计算图运维在 TensorFlow 集群上。即便那只是二个单机集群,但它基本上反映了 TensorFlow 集群的劳作流程。

本文出自 “学习笔记” 博客,请必得保留此出处

TensorFlow 集群会通过大器晚成比比皆已经义务(task)来进行总括图中的运算,通常的话分化的任务会在分化的机器上运营。TensorFlow 集群中的职责也会被集合为办事(job)。比方在锻炼深度模型时,风流潇洒台运转反向传来的机械是一个职分,而有所运维反向传来的汇聚是一个行事。上边轻巧的案例只是多少个职分的集群,若二个TensorFlow 集群有四个职务时,我们需求选择 tf.train.ClusterSpec 来钦点每一个义务的机械。

MediaServer:WindowsServer 贰零零玖Escort2 备份内容:SQLServer 数据 带库: IBM3584 巡检NBU时在activity monitor中发掘一条备份失败job,状态码为52,如...

采用布满式 TensorFlow 演习深度学习模型相符有两种方法,即 in-graph replication 和 between-graph replication。第意气风发种总结图内的分布式会令全数义务都采纳三个 TensorFlow 计算图中的变量,而只是将计算部分分配到差异的服务器上。而另一种总括图间的布满式会在每一个测算服务器上创制三个独立的 TensorFlow 总结图,但不一致计算图中的相似参数必要今后生可畏种固定的不二秘诀存放到同叁个参数服务器中。以上海高校概正是布满式 TensorFlow 的基本概念,随后我们将透过切实的案例与代码加深那生龙活虎有的的知晓。

分布式 TensorFlow

为了在经过之间分享变量,我们须求将差异的施行引擎连接在一块儿,并输入分布式张量流。

若接纳布满式 TensorFlow,各种进度会运营三个古怪的施行引擎:一个TensorFlow 服务器。服务器作为集群的风度翩翩有的链接在生龙活虎道。(群聚集的各类服务器也称之为任务。)

首先步是概念集群的框框。大家从最简便的集群初始:即两台服务器(三个任务),它们都在一直以来台机械上,三个在 2222 端口,叁个在 2223 端口。

tasks = ["localhost:2222","localhost:2223"]

各类义务都与「专业」(job)相关联,该专业是连锁任务的集纳。大家将这五个职务与八个名称为「local」的做事相关联。

jobs = {"local": tasks}

具有这么些即定义为贰个集群。

cluster = tf.train.ClusterSpec(jobs)

我们明日能够运行服务器,钦点每一个服务器对应为集群概念中的哪个服务器。立时运转各服务器,监听集群设置中钦赐的端口。

# "This server corresponds to the the first task (task_index=0)

# of the tasks associated with the 'local' job."

server1 = tf.train.Server(cluster, job_name="local", task_index=0)

server2 = tf.train.Server(cluster, job_name="local", task_index=1)

将服务器连接在同三个集群中,大家后天得以体验到布满式 TensorFlow 的无敌成效:任何具备同等名称的变量都将要具有服务器之间分享。

最简单易行的事例是在具有的服务器上运营同一江小鱼态总结图,且各种图独有一个变量:

tf.reset_default_graph()

var = tf.Variable(initial_value=0.0, name='var')

sess1 = tf.Session(server1.target)

sess2 = tf.Session(server2.target)

前些天,在一台服务器上对变量所作的改良就要其次台服务器上作镜像管理。

sess1.run(tf.global_variables_initializer())

sess2.run(tf.global_variables_initializer())

print("Initial value of var in session 1:", sess1.run(var))

print("Initial value of var in session 2:", sess2.run(var))

sess1.run(var.assign_add(1.0))

print("Incremented var in session 1")

print("Value of var in session 1:", sess1.run(var))

print("Value of var in session 2:", sess2.run(var))

Initialvalue of varinsession1:0.0

Initialvalue of varinsession2:0.0

Incrementedvarinsession1

Valueof varinsession1:1.0

Valueof varinsession2:1.0

请用心,因为大家独有一个变量且该变量由三个会话分享,第四个会话再调用 global_variables_initializer 就稍稍多余。

存放

近期大家可能会想:变量毕竟蕴藏在哪些服务器上?又是哪个服务器在运营操作?

按经验来说,变量和操作都暗许存款和储蓄在集群的第二个职务上。

defrun_with_location_trace(sess, op):

# From

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)

run_metadata = tf.RunMetadata()

sess.run(op, options=run_options, run_metadata=run_metadata)

fordeviceinrun_metadata.step_stats.dev_stats:

print(device.device)

fornodeindevice.node_stats:

print("  ", node.node_name)

诸如,假如我们选择连接到第八个职分的对话来拍卖变量 var,那么富有操作都会运作在此个任务上:

run_with_location_trace(sess1, var)

/job:local/replica:0/task:0/device:CPU:0

_SOURCE

var

run_with_location_trace(sess1, var.assign_add(1.0))

/job:local/replica:0/task:0/device:CPU:0

_SOURCE

AssignAdd_1/value

var

AssignAdd_1

而是,假使我们品尝利用连接到第三个职责的对话管理变量 var,那么图节点依旧会在首先个任务上运维。

run_with_location_trace(sess2, var)

/job:local/replica:0/task:1/device:CPU:0

_SOURCE

/job:local/replica:0/task:0/device:CPU:0

_SOURCE

var

要将二个变量或操作固定到特定职务上,大家得以采纳 tf.device:

withtf.device("/job:local/task:0"):

var1 = tf.Variable(0.0, name='var1')

withtf.device("/job:local/task:1"):

var2 = tf.Variable(0.0, name='var2')

# (This will initialize both variables)

sess1.run(tf.global_variables_initializer())

当今,var1 像从前同一运营在首先个职分上。

run_with_location_trace(sess1, var1)

/job:local/replica:0/task:0/device:CPU:0

_SOURCE

var1

不过 var2 运转在其次个职责上。即使大家尝试接收连接到第八个职分的对话来评估它,它依旧在第叁个职分上运营。

run_with_location_trace(sess1, var2)

/job:local/replica:0/task:0/device:CPU:0

_SOURCE

/job:local/replica:0/task:1/device:CPU:0

_SOURCE

var2

变量 2 亦是这么。

run_with_location_trace(sess2, var2)

/job:local/replica:0/task:1/device:CPU:0

_SOURCE

var2

run_with_location_trace(sess2, var1)

/job:local/replica:0/task:1/device:CPU:0

_SOURCE

/job:local/replica:0/task:0/device:CPU:0

_SOURCE

var1

计算图

分布式 TensorFlow 处理图的长河有几点必要注意。

谁创设了这么些图?

率先,固然在全方位集群中国共产党享变量值,但图并不会活动分享。

大家用两台服务器创立贰个新的集群,然后用显式创立的图设置第黄金时代台服务器。

cluster = tf.train.ClusterSpec({"local": ["localhost:2224","localhost:2225"]})

server1 = tf.train.Server(cluster, job_name="local", task_index=0)

server2 = tf.train.Server(cluster, job_name="local", task_index=1)

graph1 = tf.Graph()

withgraph1.as_default():

var1 = tf.Variable(0.0, name='var')

sess1 = tf.Session(target=server1.target, graph=graph1)

print(graph1.get_operations())

[, , , ]

假定我们创建连接到第二台服务器的对话,请小心图不会自行获得镜像。

graph2 = tf.Graph()

sess2 = tf.Session(target=server2.target, graph=graph2)

print(graph2.get_operations())

————————————————————————————

[]

要访谈分享变量,大家亟须手动增添一个同名的变量到第贰个图中。

withgraph2.as_default():

var2 = tf.Variable(0.0, name='var')

除非那样大家才方可访谈它。

sess1.run(var1.assign(1.0))

sess2.run(var2)

————————————————————————————

1.0

根本是:种种服务器负勒令立和煦的图。

有着服务器上的图都一定要生龙活虎致呢?

到方今截至,大家全数的例子都以在两台服务器上运营雷同的图。那被称之为图内复制(in-graph replication)。

比如,如若大家有多少个含有三台服务器的集群。服务器 1 保存分享参数,而服务器 2 和服务器 3 是工作站节点,各样都有地方变量。在图内复制中,每台服务器的图如下所示:

图片 4

图内复制的标题在于各种服务器都不得不有所任何图的别本,满含或然只与其余服务器相关的子图。那也许会形成图变得十分大。

另风流倜傥种方法是图间复制(between-graph replication)。在那地,各种服务器都运作叁个只饱含共享参数的图,况兼别的变量和操作都与单个服务器相关。

图片 5

这种措施减少了图的轻重,由此大家引入使用图间复制。

推行细节

在介绍完整示例在此以前,有多少个实施中相见的细节难点亟待研讨一下。

若果在颇有服务器互联从前尝试在集群上运维某个程序,会产生什么样?

我们再度创立一个双职务集群。

cluster = tf.train.ClusterSpec({

"local": ["localhost:2226","localhost:2227"]

})

那三遍,让咱们在切断进度中运转每一种服务器。(那允许大家每时每刻关闭服务器,以便再一次运营它们举行继续的实验。除了关闭运营服务器的进度之外,如今尚未别的措施关闭服务器。)

frommultiprocessingimportProcess

fromtimeimportsleep

defs1():

server1 = tf.train.Server(cluster,

job_name="local",

task_index=0)

sess1 = tf.Session(server1.target)

print("server 1: running no-op...")

sess1.run(tf.no_op())

print("server 1: no-op run!")

server1.join()# Block

defs2():

foriinrange(3):

print("server 2: %d seconds left before connecting..."

% (3- i))

sleep(1.0)

server2 = tf.train.Server(cluster,

job_name="local",

task_index=1)

print("server 2: connected!")

server2.join()# Block

# daemon=True so that these processes will definitely be killed

# when the parent process restarts

p1 =Process(target=s1, daemon=True)

p2 =Process(target=s2, daemon=True)

服务器 1 立马步向集群,但服务器 2 在连接以前等待了风华正茂阵子。结果如下所示:

p1.start()

p2.start()

server2:3seconds left before connecting...

server1: running no-op...

server2:2seconds left before connecting...

server2:1seconds left before connecting...

server2: connected!

server1: no-op run!

可以见到,每个服务器都考虑在集群上运维四个操作,直到全部的服务器都进入。

p1.terminate()

p2.terminate()

当服务器脱离集群会如何?

咱俩用两台服务器创立一个集群。服务器 1 只是累累尝试和平运动行坐落于服务器 1 上的 no-op 操作。服务器 2 就要两分钟后宕机。

defs1():

server1 = tf.train.Server(cluster,

job_name="local",

task_index=0)

withtf.device("/job:local/task:0"):

no_op = tf.no_op()

sess1 = tf.Session(server1.target)

for_inrange(6):

print("Server 1: about to run no-op...", end="")

sess1.run(no_op)

print("success!")

sleep(1.0)

defs2():

server2 = tf.train.Server(cluster,

job_name="local",

task_index=1)

sleep(2.0)

print("Server 2 dieing...")

p1 =Process(target=s1, daemon=True)

p2 =Process(target=s2, daemon=True)

p1.start()

p2.start()

————————————————————————————————

Server1: about to run no-op...success!

Server1: about to run no-op...success!

Server2dieing...

Server1: about to run no-op...success!

Server1: about to run no-op...success!

Server1: about to run no-op...success!

Server1: about to run no-op...success!

长时间内,只要大家试图运营的操作不在脱离的服务器上,就如不会并发难题。(笔者并没有测量试验过长时间运转会发生哪些。)

假定操作是在退出的服务器上……

defs1():

server1 = tf.train.Server(cluster,

job_name="local",

task_index=0)

# This time, we place the no-op on server 2,

# which is going to leave

withtf.device("/job:local/task:1"):

no_op = tf.no_op()

sess1 = tf.Session(server1.target)

for_inrange(5):

print("Server 1: about to run no-op...", end="")

sess1.run(no_op)

print("success!")

sleep(1.0)

p1 =Process(target=s1, daemon=True)

p2 =Process(target=s2, daemon=True)

p1.start()

p2.start()

——————————————————————————————————

Server1: about to run no-op...success!

Server1: about to run no-op...success!

Server2dieing...

下一场尝试运营操作代码。

p1.terminate()

p2.terminate()

大器晚成旦服务器又走入集群会怎么样?

p1 =Process(target=s1, daemon=True)

p2 =Process(target=s2, daemon=True)

p1.start()

p2.start()

sleep(3.0)

# At this point, server 1 is blocked, and server 2 is dead.

print("Restarting server 2...")

p2 =Process(target=s2, daemon=True)

p2.start()

————————————————————————————

Server1: about to run no-op...success!

Server1: about to run no-op...success!

Server2dieing...

Restartingserver2...

ProcessProcess-7:

Traceback(most recent call last):

File"/Users/matthew/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line1323,in_do_call

returnfn(*args)

File"/Users/matthew/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line1302,in_run_fn

status, run_metadata)

File"/Users/matthew/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line473,in__exit__

c_api.TF_GetCode(self.status.status))

tensorflow.python.framework.errors_impl.AbortedError:Graphhandleisnotfound:0000000000000001

Server1: about to run no-op...Server2dieing...

系统报了八个 Graph handle is not found 的不当。

所以分布式 TensorFlow 不会自动苏醒服务器故障。(倘若您对容错有兴趣,请查看

何人担当开首化分享变量?

大器晚成种艺术是让具有专门的学业站运转 tf.global_variables_initializer()。

只是只要大家想维持代码整洁并且只用一个服务器实行开首化,那么生机勃勃旦有其余服务器在初阶化在此以前尝试利用那个变量,或者会遇见标题。多个缓慢解决方案就是让任何职业站等待,直到使用 tf.report_uninitialized_variables 的初阶化开端。

defs1():

server1 = tf.train.Server(cluster,

job_name="local",

task_index=0)

var = tf.Variable(0.0, name='var')

sess1 = tf.Session(server1.target)

print("Server 1: waiting for connection...")

sess1.run(tf.report_uninitialized_variables())

whilelen(sess1.run(tf.report_uninitialized_variables())) >0:

print("Server 1: waiting for initialization...")

sleep(1.0)

print("Server 1: variables initialized!")

defs2():

server2 = tf.train.Server(cluster,

job_name="local",

task_index=1)

var = tf.Variable(0.0, name='var')

sess2 = tf.Session(server2.target)

foriinrange(3):

print("Server 2: waiting %d seconds before initializing..."

% (3- i))

sleep(1.0)

sess2.run(tf.global_variables_initializer())

p1 =Process(target=s1, daemon=True)

p2 =Process(target=s2, daemon=True)

p1.start()

p2.start()

—————————————————————————————————

Server1: waitingforconnection...

Server2: waiting3seconds before initializing...

Server1: waitingforinitialization...

Server2: waiting2seconds before initializing...

Server1: waitingforinitialization...

Server2: waiting1seconds before initializing...

Server1: waitingforinitialization...

Server1: variables initialized!

p1.terminate()

p2.terminate()

示例

让我们把所学的学问融合到最后三个施用多进程的例子中。

我们将成立:

贰个存款和储蓄单个变量 var 的参数服务器。

三个职业站职务(worker task),各个职业站将一再充实变量 var 的值。

小编们将让参数服务器多输出若干回 var 的值,以便查看其变化。

importtensorflowastf

frommultiprocessingimportProcess

fromtimeimportsleep

cluster = tf.train.ClusterSpec({

"worker": [

"localhost:3333",

"localhost:3334",

],

"ps": [

"localhost:3335"

]

})

defparameter_server():

withtf.device("/job:ps/task:0"):

var = tf.Variable(0.0, name='var')

server = tf.train.Server(cluster,

job_name="ps",

task_index=0)

sess = tf.Session(target=server.target)

print("Parameter server: waiting for cluster connection...")

sess.run(tf.report_uninitialized_variables())

print("Parameter server: cluster ready!")

print("Parameter server: initializing variables...")

sess.run(tf.global_variables_initializer())

print("Parameter server: variables initialized")

foriinrange(5):

val = sess.run(var)

print("Parameter server: var has value %.1f"% val)

sleep(1.0)

print("Parameter server: blocking...")

server.join()

defworker(worker_n):

withtf.device("/job:ps/task:0"):

var = tf.Variable(0.0, name='var')

server = tf.train.Server(cluster,

job_name="worker",

task_index=worker_n)

sess = tf.Session(target=server.target)

print("Worker %d: waiting for cluster connection..."% worker_n)

sess.run(tf.report_uninitialized_variables())

print("Worker %d: cluster ready!"% worker_n)

whilesess.run(tf.report_uninitialized_variables()):

print("Worker %d: waiting for variable initialization..."% worker_n)

sleep(1.0)

print("Worker %d: variables initialized"% worker_n)

foriinrange(5):

print("Worker %d: incrementing var"% worker_n)

sess.run(var.assign_add(1.0))

sleep(1.0)

print("Worker %d: blocking..."% worker_n)

server.join()

ps_proc =Process(target=parameter_server, daemon=True)

w1_proc =Process(target=worker, args=(0, ), daemon=True)

w2_proc =Process(target=worker, args=(1, ), daemon=True)

ps_proc.start()

————————————————————————————

Parameterserver: waitingforcluster connection...

Parameterserver: cluster ready!

Parameterserver: initializing variables...

Parameterserver: variables initialized

Parameterserver: var has value0.0

Parameterserver: var has value2.0

Parameterserver: var has value4.0

Parameterserver: var has value5.0

Parameterserver: var has value7.0

Parameterserver: blocking...

w1_proc.start()

————————————————————————————————

Worker0: waitingforcluster connection...

Worker0: cluster ready!

Worker0: waitingforvariable initialization...

Worker0: variables initialized

Worker0: incrementing var

Worker0: incrementing var

Worker0: incrementing var

Worker0: incrementing var

Worker0: incrementing var

Worker0: blocking...

w2_proc.start()

———————————————————————————————

Worker1: waitingforcluster connection...

Worker1: cluster ready!

Worker1: waitingforvariable initialization...

Worker1: variables initialized

Worker1: incrementing var

Worker1: incrementing var

Worker1: incrementing var

Worker1: incrementing var

Worker1: incrementing var

Worker1: blocking...

forprocin[w1_proc, w2_proc, ps_proc]:

proc.terminate()

总结

透过本文,大家询问了:

何以将四个 TensorFlow 试行引擎(运维在分歧进度或分化机器上)集成为二个集群,以便分享变量。

怎样为变量或操作钦定服务器。

图内复制与图间复制。

在有着服务器互联以前或在服务器脱离集群之后在集群上运维操作,会产生哪些。

怎样等待变量被集群中的另叁个任务初叶化。

本文由美高梅网址发布于计算机网络,转载请注明出处:分布式TensorFlow入坑指南,队列等待

上一篇:没有了 下一篇:没有了
猜你喜欢
热门排行
精彩图文