FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Ubuntu > Ubuntu User

 
 
LinkBack Thread Tools
 
Old 11-20-2011, 12:10 AM
"Kevin O'Gorman"
 
Default Bash script clobbers something vital (lucid)

On Sat, Nov 19, 2011 at 6:36 AM, J <dreadpiratejeff@gmail.com> wrote:
> On Sat, Nov 19, 2011 at 03:47, Colin Law <clanlaw@googlemail.com> wrote:
>
>> Have a look in syslog and see what are the first errors that appear
>> there. *No need to get it to fail again for this, you can go back to
>> the previous log, assuming you know when it happened. *Also if you run
>> the disc utilities do you see any errors noted in the SMART data?

Yesterday's syslog is 11 MB for this and other reasons. I can't be
sure which entry goes with what. I'm trying again today with shorter
runs, and I'll keep trying until I get something to report.
>>
>> Are you doing anything else when it fails (plugging/unplugging
>> additional usb devices for example)?

Nothing but some activity in my web browser (for which I sometimes get
complaints from tar). Nothing gets mounted/unmounted/plugged
in/unplugged.

++ kevin

--
Kevin O'Gorman, PhD

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-20-2011, 12:20 AM
"compdoc"
 
Default Bash script clobbers something vital (lucid)

When building systems, copying a few very large files used to be an excellent way to test a system. If the hardware, or certain bios settings, the ram, or the drivers were wrong/bad, the system would hang during the transfer or the transfer would be interrupted.

A large transfer should flow smoothly and with no pauses - from drive to drive, or from system to network share.




--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-20-2011, 05:47 AM
"Kevin O'Gorman"
 
Default Bash script clobbers something vital (lucid)

On Sat, Nov 19, 2011 at 3:58 PM, Karl Auer <kauer@biplane.com.au> wrote:
> On Sat, 2011-11-19 at 14:21 -0800, Kevin O'Gorman wrote:
>> > Post the script.
>>
>> Attached. It's in three parts
>
> At first blush, I'd say you need to check the inputs more carefully -
> when you are playing around with fdisk and dd, it's essential that the
> parameters are correct. So in bkfuncts.sh, I'd be wrapping some serious
> error checking around those exported variables, especially drive and
> loc. It may not have anything to do with the current problem, but it
> will probably save you somewhere down the track.

Drive is checked against active mountpoints. If it's not mounted, the
script dies. I don't see how to improve on that.
Loc comes directly from hostname(1). It's used to make backup
filenames and to locate bkdropkick.sh in the current case, and other
files on other hosts.

>> everything on the local machine. * The problem happens in the middle
>> of this script.
>
> Locating exactly where a bug happens is pretty much the first step to
> fixing it. If the symptom is that the drive is no longer readable, then
> set up a telltale file and check at likely points in your scripts that
> it still exists. If you suddenly can't find it or read it, the failure
> has happened between that point and the last point where you could see
> it. That narrows down the debug space.
>
> If you can reduce the magnitude of the backup while you debug, it will
> speed your debugging - can you set up a virtual with small disks and and
> run all this stuff on the virtual? If it doesn't happen on the virtual,
> that's interesting information too.

I already know how to debug, actually, though my original posting may
not reflect that (3 am if I recall). My current hypothesis is that it
only fails on large workloads, partly because the eventual failure is
of a kind one would normally have been seen and fixed long ago. I'm
currently trying to build up the size from a shortened version that
does not fail. The "bksumit" adds a lot of time, but is my prime
suspect for where the bug is. (My current theory is that "chattr +i
*" causes the problem when there are dirty pages still in cache.)
Once I have a small-sized failure, I'll try the obvious fixes.

>
>> If I comment out all of the commands
>> that worked, running the shell starts computing the md5sum's okay.
>
> If you know what commands worked, presumably you know which command
> didn't... do you know which command failed?
>
I think so, but it needs checking. md5sum and everything after it
that referenced the backup drive. I do know a number of commands that
did not fail, since they left files on the backup volume. Simply
retaining the last such does not cause the error, however, so I must
sneak up on this a little. The runs are long enough that I can only
try a few per day. How nice it is that I'm recently retired...

--
Kevin O'Gorman, PhD

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-20-2011, 06:41 AM
Karl Auer
 
Default Bash script clobbers something vital (lucid)

On Sat, 2011-11-19 at 22:47 -0800, Kevin O'Gorman wrote:
> I already know how to debug, actually

Oh, good. Leave you to it, then.

Regards, K.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer@biplane.com.au) +61-2-64957160 (h)
http://www.biplane.com.au/kauer/ +61-428-957160 (mob)

GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156
--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-20-2011, 04:32 PM
"Kevin O'Gorman"
 
Default Bash script clobbers something vital (lucid)

On Sat, Nov 19, 2011 at 11:41 PM, Karl Auer <kauer@biplane.com.au> wrote:
> On Sat, 2011-11-19 at 22:47 -0800, Kevin O'Gorman wrote:
>> I already know how to debug, actually
>
> Oh, good. Leave you to it, then.

Thanks. I could use some help with interpretation of results, however.

I left a shorter version running last night around midnight and it
exhibits the problem, albeit not quite in the way I expected. It ran
for 2 hours, and quietly died (no error messages) after 2 hours. The
backup drive is inaccessable, as reported earlier.

There was an earlier suggestion that I should look in syslog. There
is nothing in syslog for the period the backup was running (except for
the usual crontab reports). Then at 7:08 AM, there are things
relating to my checking results:

Nov 20 06:47:01 localhost CRON[5514]: (root) CMD (test -x
/usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))
Nov 20 06:47:50 localhost ntpd[2273]: kernel time sync status change 2001
Nov 20 07:08:46 localhost kernel: [122104.602699] usb 2-1.1.3: USB
disconnect, address 7
Nov 20 07:08:46 localhost kernel: [122104.975064] usb 2-1.1.3: new
high speed USB device using ehci_hcd and address 10
Nov 20 07:08:46 localhost kernel: [122105.084686] usb 2-1.1.3:
configuration #1 chosen from 1 choice
Nov 20 07:08:46 localhost kernel: [122105.085894] scsi6 : SCSI
emulation for USB Mass Storage devices
Nov 20 07:08:46 localhost kernel: [122105.086063] usb-storage: device
found at 10
Nov 20 07:08:46 localhost kernel: [122105.086066] usb-storage: waiting
for device to settle before scanning
Nov 20 07:08:51 localhost kernel: [122110.081767] usb-storage: device
scan complete
Nov 20 07:08:51 localhost kernel: [122110.082578] scsi 6:0:0:0:
Direct-Access ST2000DL 003-9VT166 PQ: 0 ANSI: 2 CCS
Nov 20 07:08:51 localhost kernel: [122110.083753] sd 6:0:0:0: Attached
scsi generic sg3 type 0
Nov 20 07:08:51 localhost kernel: [122110.091233] sd 6:0:0:0: [sdd]
3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Nov 20 07:08:51 localhost kernel: [122110.092948] sd 6:0:0:0: [sdd]
Write Protect is off
Nov 20 07:08:51 localhost kernel: [122110.092955] sd 6:0:0:0: [sdd]
Mode Sense: 00 38 00 00
Nov 20 07:08:51 localhost kernel: [122110.092960] sd 6:0:0:0: [sdd]
Assuming drive cache: write through
Nov 20 07:08:51 localhost kernel: [122110.095139] sd 6:0:0:0: [sdd]
Assuming drive cache: write through
Nov 20 07:08:51 localhost kernel: [122110.095147] sdd: sdd1
Nov 20 07:08:51 localhost kernel: [122110.104675] sdd: p1 size
4294953054 exceeds device capacity, limited to end of disk
Nov 20 07:08:51 localhost kernel: [122110.106989] sd 6:0:0:0: [sdd]
Assuming drive cache: write through
Nov 20 07:08:51 localhost kernel: [122110.106993] sd 6:0:0:0: [sdd]
Attached SCSI disk
Nov 20 07:09:01 localhost CRON[5547]: (root) CMD ( [ -x
/usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find
/var/lib/php5/ -type f -cmin +$(/usr/lib/php5/max+++ lifetime)
-print0 | xargs -n 200 -r -0 rm)
Nov 20 07:16:09 localhost kernel: [122547.231676] EXT4-fs error
(device sdc1): __ext4_get_inode_loc: unable to read inode block -
inode=9308187, block=148897953
Nov 20 07:16:09 localhost kernel: [122547.231702] EXT4-fs error
(device sdc1) in ext4_reserve_inode_write: IO failure
Nov 20 07:16:09 localhost kernel: [122547.231708] EXT4-fs (sdc1):
previous I/O error to superblock detected
Nov 20 07:16:09 localhost kernel: [122547.276335] EXT4-fs error
(device sdc1): ext4_find_entry: reading directory #9308187 offset 0
Nov 20 07:16:09 localhost kernel: [122547.276347] EXT4-fs (sdc1):
previous I/O error to superblock detected
Nov 20 07:16:14 localhost kernel: [122552.608123] end_request: I/O
error, dev sdc, sector 1950781968
Nov 20 07:16:14 localhost kernel: [122552.608135] Aborting journal on
device sdc1-8.
Nov 20 07:16:14 localhost kernel: [122552.608154] JBD2: I/O error
detected when updating journal superblock for sdc1-8.
Nov 20 07:17:01 localhost CRON[5563]: (root) CMD ( cd / && run-parts
--report /etc/cron.hourly)
Nov 20 07:30:01 localhost CRON[5686]: (root) CMD (start -q anacron ||

The output on the screen is also remarkable. It ends thus:

*** Sat Nov 19 23:49:32 PST 2011 Back up root (no home)
*** Sun Nov 20 01:45:40 PST 2011 Back up subdirectory home (.tgz)
*** Sun Nov 20 01:58:11 PST 2011 Back up subdirectory usr/local (.tgz)
*** Sun Nov 20 01:58:13 PST 2011 Show usage after the backup
*** Sun Nov 20 02:44:47 PST 2011 "./ball.sh 2te" finished on dropkick

real 175m22.010s
user 118m6.667s
sys 7m17.479s

The 01:58:13 entry is from function bkdf. The one after it is from
bkfinish, almost an hour later. Presumably, bksumit and bkscripts ran
(silently, which is normal) in between. This is what I would expect
from a correct run. However, as you can see from the above syslog,
there was something seriously wrong with access to the backup drive
when I came back to check on it this morning. As I write this, I'm
leaving the backup drive in this state so that I can figure out how to
have my script detect this state.

I'm presuming the drive will be accessable after a reboot.

Anybody know how to interpret the syslog?

--
Kevin O'Gorman, PhD

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 
Old 11-22-2011, 02:42 AM
"Kevin O'Gorman"
 
Default Bash script clobbers something vital (lucid)

Synopsis: not a bug, but a hardware problem.

On Sun, Nov 20, 2011 at 9:32 AM, Kevin O'Gorman <kogorman@gmail.com> wrote:
> On Sat, Nov 19, 2011 at 11:41 PM, Karl Auer <kauer@biplane.com.au> wrote:
>> On Sat, 2011-11-19 at 22:47 -0800, Kevin O'Gorman wrote:
>>> I already know how to debug, actually
>>
>> Oh, good. Leave you to it, then.
>
> Thanks. *I could use some help with interpretation of results, however.
>
> I left a shorter version running last night around midnight and it
> exhibits the problem, albeit not quite in the way I expected. *It ran
> for 2 hours, and quietly died (no error messages) after 2 hours. *The
> backup drive is inaccessable, as reported earlier.
>
> There was an earlier suggestion that I should look in syslog. *There
> is nothing in syslog for the period the backup was running (except for
> the usual crontab reports). *Then at 7:08 AM, there are things
> relating to my checking results:
>
> Nov 20 06:47:01 localhost CRON[5514]: (root) CMD (test -x
> /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ))
> Nov 20 06:47:50 localhost ntpd[2273]: kernel time sync status change 2001
> Nov 20 07:08:46 localhost kernel: [122104.602699] usb 2-1.1.3: USB
> disconnect, address 7
> Nov 20 07:08:46 localhost kernel: [122104.975064] usb 2-1.1.3: new
> high speed USB device using ehci_hcd and address 10
> Nov 20 07:08:46 localhost kernel: [122105.084686] usb 2-1.1.3:
> configuration #1 chosen from 1 choice
> Nov 20 07:08:46 localhost kernel: [122105.085894] scsi6 : SCSI
> emulation for USB Mass Storage devices
> Nov 20 07:08:46 localhost kernel: [122105.086063] usb-storage: device
> found at 10
> Nov 20 07:08:46 localhost kernel: [122105.086066] usb-storage: waiting
> for device to settle before scanning
> Nov 20 07:08:51 localhost kernel: [122110.081767] usb-storage: device
> scan complete
> Nov 20 07:08:51 localhost kernel: [122110.082578] scsi 6:0:0:0:
> Direct-Access * * ST2000DL 003-9VT166 * * * * * *PQ: 0 ANSI: 2 CCS
> Nov 20 07:08:51 localhost kernel: [122110.083753] sd 6:0:0:0: Attached
> scsi generic sg3 type 0
> Nov 20 07:08:51 localhost kernel: [122110.091233] sd 6:0:0:0: [sdd]
> 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
> Nov 20 07:08:51 localhost kernel: [122110.092948] sd 6:0:0:0: [sdd]
> Write Protect is off
> Nov 20 07:08:51 localhost kernel: [122110.092955] sd 6:0:0:0: [sdd]
> Mode Sense: 00 38 00 00
> Nov 20 07:08:51 localhost kernel: [122110.092960] sd 6:0:0:0: [sdd]
> Assuming drive cache: write through
> Nov 20 07:08:51 localhost kernel: [122110.095139] sd 6:0:0:0: [sdd]
> Assuming drive cache: write through
> Nov 20 07:08:51 localhost kernel: [122110.095147] *sdd: sdd1
> Nov 20 07:08:51 localhost kernel: [122110.104675] sdd: p1 size
> 4294953054 exceeds device capacity, limited to end of disk
> Nov 20 07:08:51 localhost kernel: [122110.106989] sd 6:0:0:0: [sdd]
> Assuming drive cache: write through
> Nov 20 07:08:51 localhost kernel: [122110.106993] sd 6:0:0:0: [sdd]
> Attached SCSI disk
> Nov 20 07:09:01 localhost CRON[5547]: (root) CMD ( *[ -x
> /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find
> /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/max+++ *lifetime)
> -print0 | xargs -n 200 -r -0 rm)
> Nov 20 07:16:09 localhost kernel: [122547.231676] EXT4-fs error
> (device sdc1): __ext4_get_inode_loc: unable to read inode block -
> inode=9308187, block=148897953
> Nov 20 07:16:09 localhost kernel: [122547.231702] EXT4-fs error
> (device sdc1) in ext4_reserve_inode_write: IO failure
> Nov 20 07:16:09 localhost kernel: [122547.231708] EXT4-fs (sdc1):
> previous I/O error to superblock detected
> Nov 20 07:16:09 localhost kernel: [122547.276335] EXT4-fs error
> (device sdc1): ext4_find_entry: reading directory #9308187 offset 0
> Nov 20 07:16:09 localhost kernel: [122547.276347] EXT4-fs (sdc1):
> previous I/O error to superblock detected
> Nov 20 07:16:14 localhost kernel: [122552.608123] end_request: I/O
> error, dev sdc, sector 1950781968
> Nov 20 07:16:14 localhost kernel: [122552.608135] Aborting journal on
> device sdc1-8.
> Nov 20 07:16:14 localhost kernel: [122552.608154] JBD2: I/O error
> detected when updating journal superblock for sdc1-8.
> Nov 20 07:17:01 localhost CRON[5563]: (root) CMD ( * cd / && run-parts
> --report /etc/cron.hourly)
> Nov 20 07:30:01 localhost CRON[5686]: (root) CMD (start -q anacron ||
>
> The output on the screen is also remarkable. *It ends thus:
>
> **** Sat Nov 19 23:49:32 PST 2011 Back up root (no home)
> **** Sun Nov 20 01:45:40 PST 2011 Back up subdirectory home (.tgz)
> **** Sun Nov 20 01:58:11 PST 2011 Back up subdirectory usr/local (.tgz)
> **** Sun Nov 20 01:58:13 PST 2011 Show usage after the backup
> **** Sun Nov 20 02:44:47 PST 2011 "./ball.sh 2te" finished on dropkick
>
> real * *175m22.010s
> user * *118m6.667s
> sys * * 7m17.479s
>
> The 01:58:13 entry is from function bkdf. *The one after it is from
> bkfinish, almost an hour later. *Presumably, bksumit and bkscripts ran
> (silently, which is normal) in between. *This is what I would expect
> from a correct run. *However, as you can see from the above syslog,
> there was something seriously wrong with access to the backup drive
> when I came back to check on it this morning. *As I write this, I'm
> leaving the backup drive in this state so that I can figure out how to
> have my script detect this state.
>
> I'm presuming the drive will be accessable after a reboot.
>
> Anybody know how to interpret the syslog?

Sorry for the excitement. The errors were not consistent, so trying
everything, I found that the problem disappeared when I changed USB
ports and cables.

--
Kevin O'Gorman, PhD

--
ubuntu-users mailing list
ubuntu-users@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
 

Thread Tools




All times are GMT. The time now is 01:53 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org