Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Fedora Infrastructure (http://www.linux-archive.org/fedora-infrastructure/)
-   -   builders of the future!!!!! (http://www.linux-archive.org/fedora-infrastructure/646927-builders-future.html)

seth vidal 03-20-2012 05:44 PM

builders of the future!!!!!
 
The discussion on devel list about ARM and my work last week on
reinstalling builders quickly and commonly has raised a number of
issues with how we manage our builders and how we should manage them in
the future.

It is apparent that if we add arm builders they will be lots of
physical systems (probably in a very small space) but physical,
none-the-less. So we need a sensible way to manage and reinstall these
hosts commonly and quickly.

Additionally, we need to consider what the introduction of a largish
number of arm builders (and other arm infrastructure) would do to our
existing puppet setup. Specifically overloading it pretty badly and
making it not-very-manageable.

I'm making certain assumptions here and I'd like to be clear about what
those are:

1. the builders need to be kept pristine
2. that currently our builders are not freshly installed frequently
enough.
3. that the builders are relatively static in their
configuration and most changes are done with pkg additions
4. that builder setups require at least two manual-ish steps of a koji
admin who can disable/enable/register the builder with the kojihub.
5. that the builders are fairly different networking and setup-wise to
the rest of our systems.

So I am proposing that we consider the following as a general process
for maintaining our builders:

1. disable the builder in koji
2. make sure all jobs are finished
3. add installer entries into grub (or run the undefine, reinstall
process if the builder is virt-based)
4. reinstall the system
5. monitor for ssh to return
6. connect in and force our post-install configuration: identification,
network, mount-point setup, ssl certs/keys for koji, etc
7. reboot
8. re-enable host in koji

We would do this with frequency and regularity. Perhaps even having
some percentage of our builders doing this at all times. Ie: 1/10th of
the boxes reinstalling at any given moment so in a certain time
frame*10 all of them are reinstalled.

Additionally, this would mean these systems would NOT have a puppet
management piece at all. Package updates would still be handled
by pushes as we do now, if things were security critical, but barring
the need for significant changes we could rely on the boxes simply being
refreshed frequently enough that it wouldn't need to be pushed.

What do folks think about this idea? It would dramatically reduce the
node entries in our puppet config, it would drop the number of hosts
connecting to puppet, too. It will mean more systems being reinstalled
and more often. It will also require some work to make the steps I
mention above be automated. I think I can achieve that without too much
difficulty, actually. I think, in general, it will increase our ability
to scale up to more and more builders.


I'd like input, constructive, please.

Thanks,
-sv
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Dennis Gilmore 03-21-2012 01:38 AM

builders of the future!!!!!
 
El Tue, 20 Mar 2012 14:44:07 -0400
seth vidal <skvidal@fedoraproject.org> escribió:
> The discussion on devel list about ARM and my work last week on
> reinstalling builders quickly and commonly has raised a number of
> issues with how we manage our builders and how we should manage them
> in the future.
>
> It is apparent that if we add arm builders they will be lots of
> physical systems (probably in a very small space) but physical,
> none-the-less. So we need a sensible way to manage and reinstall these
> hosts commonly and quickly.

Today there is not a way to do an anaconda install on any arm system.
though hopefully we will have that for deployment.

> Additionally, we need to consider what the introduction of a largish
> number of arm builders (and other arm infrastructure) would do to our
> existing puppet setup. Specifically overloading it pretty badly and
> making it not-very-manageable.

probably we would be adding 100-300 systems. not only do we need to
consider overloading of puppet, but also logging and monitoring. I
guess its more how do we scale our infrastructure from at a guess ~100
nodes today to 3 to 4 times that

> I'm making certain assumptions here and I'd like to be clear about
> what those are:
>
> 1. the builders need to be kept pristine
> 2. that currently our builders are not freshly installed frequently
> enough.
> 3. that the builders are relatively static in their
> configuration and most changes are done with pkg additions
> 4. that builder setups require at least two manual-ish steps of a koji
> admin who can disable/enable/register the builder with the kojihub.
> 5. that the builders are fairly different networking and setup-wise to
> the rest of our systems.
>
> So I am proposing that we consider the following as a general process
> for maintaining our builders:
>
> 1. disable the builder in koji
> 2. make sure all jobs are finished
> 3. add installer entries into grub (or run the undefine, reinstall
> process if the builder is virt-based)
> 4. reinstall the system
> 5. monitor for ssh to return
> 6. connect in and force our post-install configuration:
> identification, network, mount-point setup, ssl certs/keys for koji,
> etc 7. reboot
> 8. re-enable host in koji
>
> We would do this with frequency and regularity. Perhaps even having
> some percentage of our builders doing this at all times. Ie: 1/10th of
> the boxes reinstalling at any given moment so in a certain time
> frame*10 all of them are reinstalled.

honestly we could do this instead of the monthly updates. just rebuild
them instead

>
> Additionally, this would mean these systems would NOT have a puppet
> management piece at all. Package updates would still be handled
> by pushes as we do now, if things were security critical, but barring
> the need for significant changes we could rely on the boxes simply
> being refreshed frequently enough that it wouldn't need to be pushed.

im ok with that, im pretty sure fas will scale to the extra boxes. do
we drop monitoring of the builders? what about collectd etc.

> What do folks think about this idea? It would dramatically reduce the
> node entries in our puppet config, it would drop the number of hosts
> connecting to puppet, too. It will mean more systems being reinstalled
> and more often. It will also require some work to make the steps I
> mention above be automated. I think I can achieve that without too
> much difficulty, actually. I think, in general, it will increase our
> ability to scale up to more and more builders.

main issue is that today we are not 100% sure of how we will install
arm boxes. how do we deal with all the non puppet related systems? also
need to look into how we can better scale koji itself. when we go from
20 to 200+ builders we need to make sure that load doesn't cause koji
to fall over.


all the arm boxes will have management consoles. but today im not 100%
sure how access to that would be. we would also need to deploy fedora
for any arm based systems. things we need to reconsider also is
networking today the storage network and the builder networks are /24's
so we could use 253 nodes. i suspect we will go over that on the build
network. we could not have the storage network on arm builders. it is
really only needed for createrepo. but we may need to look at expanding
kojipkgs to more nodes. or increase its network throughput with multiple
bonded gig network ports. think mass rebuild and 100 or 200 buildroots
initialising at once. it will stress our resources on all levels. but
the flexibility of so many nodes could allow us to deploy solid
solutions to scale and show that fedora is still the leader in open
infrastructure and sets industry best practices.

Dennis
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

seth vidal 03-21-2012 01:08 PM

builders of the future!!!!!
 
On Tue, 20 Mar 2012 21:38:13 -0500
Dennis Gilmore <dennis@ausil.us> wrote:


> Today there is not a way to do an anaconda install on any arm system.
> though hopefully we will have that for deployment.

I would hope so. :)

> probably we would be adding 100-300 systems. not only do we need to
> consider overloading of puppet, but also logging and monitoring. I
> guess its more how do we scale our infrastructure from at a guess ~100
> nodes today to 3 to 4 times that

Centrally logging the builders is probably unnecessary. Especially if
we're bouncing them all the time.


> honestly we could do this instead of the monthly updates. just rebuild
> them instead

Sure - but I'm thinking of the emergency "oh look at that nightmare"
updates.


> im ok with that, im pretty sure fas will scale to the extra boxes. do
> we drop monitoring of the builders? what about collectd etc.

Collectd - off. We're not gaining much by having that punish the syslog
server.
We can monitor the builders w/o needing all of the copious info that
collectd provides.

fas I'm not very worried about - though I suspect a couple of things
will change w/how we get the dbs onto the hosts.


> main issue is that today we are not 100% sure of how we will install
> arm boxes. how do we deal with all the non puppet related systems?

I think, if the playbooks are working well, we can use ansible to do
this.

> also need to look into how we can better scale koji itself. when we
> go from 20 to 200+ builders we need to make sure that load doesn't
> cause koji to fall over.

okay - but I think that's more something for the kojidevs than fedora
infra?



> all the arm boxes will have management consoles. but today im not 100%
> sure how access to that would be. we would also need to deploy fedora
> for any arm based systems. things we need to reconsider also is
> networking today the storage network and the builder networks
> are /24's so we could use 253 nodes. i suspect we will go over that
> on the build network. we could not have the storage network on arm
> builders. it is really only needed for createrepo. but we may need to
> look at expanding kojipkgs to more nodes. or increase its network
> throughput with multiple bonded gig network ports. think mass rebuild
> and 100 or 200 buildroots initialising at once. it will stress our
> resources on all levels. but the flexibility of so many nodes could
> allow us to deploy solid solutions to scale and show that fedora is
> still the leader in open infrastructure and sets industry best
> practices.

So one thing I'm not sure I understand - why would we need so many arm
builders? Is it b/c there are so many more arm archs so there will need
to be more pkgs built?


-sv
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Kevin Fenzi 03-21-2012 01:33 PM

builders of the future!!!!!
 
On Tue, 20 Mar 2012 21:38:13 -0500
Dennis Gilmore <dennis@ausil.us> wrote:

...snip...

> probably we would be adding 100-300 systems. not only do we need to
> consider overloading of puppet, but also logging and monitoring. I
> guess its more how do we scale our infrastructure from at a guess ~100
> nodes today to 3 to 4 times that

Yeah.

...snip...

> im ok with that, im pretty sure fas will scale to the extra boxes. do
> we drop monitoring of the builders? what about collectd etc.

There's a few things we could do on fas load:

a) add more fas servers.
b) reduce the number of runs. How often do we change someone in
sysadmin-noc, sysadmin-main, sysadmin-build?
c) move to a system where we only re-run fasClient when there is a
change.

I'd agree collectd off probibly. Or at least a seperate one if we
needed to monitor them.

> main issue is that today we are not 100% sure of how we will install
> arm boxes. how do we deal with all the non puppet related systems?
> also need to look into how we can better scale koji itself. when we
> go from 20 to 200+ builders we need to make sure that load doesn't
> cause koji to fall over.

yeah.

> all the arm boxes will have management consoles. but today im not 100%
> sure how access to that would be. we would also need to deploy fedora
> for any arm based systems. things we need to reconsider also is
> networking today the storage network and the builder networks
> are /24's so we could use 253 nodes. i suspect we will go over that
> on the build network. we could not have the storage network on arm
> builders. it is really only needed for createrepo. but we may need to
> look at expanding kojipkgs to more nodes. or increase its network
> throughput with multiple bonded gig network ports. think mass rebuild
> and 100 or 200 buildroots initialising at once. it will stress our
> resources on all levels. but the flexibility of so many nodes could
> allow us to deploy solid solutions to scale and show that fedora is
> still the leader in open infrastructure and sets industry best
> practices.

Yeah, we could hopefully have another network thats larger than /24 for
the arm builders.

I'm sure some of this will be a process of 'oh no, what we have now
doesn't scale, lets fix it'. Of course some of it we can get ready for
up front too.

Overall I like the idea of the automated builder re-install and think
it will get us more ready for things like a large arm cluster.

kevin
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

seth vidal 03-21-2012 02:03 PM

builders of the future!!!!!
 
On Wed, 21 Mar 2012 08:33:51 -0600
Kevin Fenzi <kevin@scrye.com> wrote:

> There's a few things we could do on fas load:
>
> a) add more fas servers.
> b) reduce the number of runs. How often do we change someone in
> sysadmin-noc, sysadmin-main, sysadmin-build?
> c) move to a system where we only re-run fasClient when there is a
> change.

I'm thinking for the hosts which are sysadmin-ish only - do C.

for the publicish hosts continue to poll fas directly.

so:
- hosted, people, bastion, publictests == poll
- everything else is a set built and pushed to them.


> I'd agree collectd off probibly. Or at least a seperate one if we
> needed to monitor them.

I'm not sure what benefit we get from collectd on transient builders,
though.

On our long-running hosts I understand but not on the builders.



>
> Yeah, we could hopefully have another network thats larger than /24
> for the arm builders.

I can imagine various network changes should easily allow us to
allocate larger than a /24 to the internal build network.


> I'm sure some of this will be a process of 'oh no, what we have now
> doesn't scale, lets fix it'. Of course some of it we can get ready for
> up front too.

yay for planning! :)


> Overall I like the idea of the automated builder re-install and think
> it will get us more ready for things like a large arm cluster.

Then I will get crackin' on making it work.

-sv
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Kevin Fenzi 03-21-2012 02:19 PM

builders of the future!!!!!
 
On Wed, 21 Mar 2012 11:03:24 -0400
seth vidal <skvidal@fedoraproject.org> wrote:

> On Wed, 21 Mar 2012 08:33:51 -0600
> Kevin Fenzi <kevin@scrye.com> wrote:
>
> > There's a few things we could do on fas load:
> >
> > a) add more fas servers.
> > b) reduce the number of runs. How often do we change someone in
> > sysadmin-noc, sysadmin-main, sysadmin-build?
> > c) move to a system where we only re-run fasClient when there is a
> > change.
>
> I'm thinking for the hosts which are sysadmin-ish only - do C.
>
> for the publicish hosts continue to poll fas directly.
>
> so:
> - hosted, people, bastion, publictests == poll
> - everything else is a set built and pushed to them.

Yeah, the trick is knowing when there is a change that affects them...

I wonder if we could make fas smarter. Have a serial # for each group.
It pulls and keeps track of that. Then it pulls again but just asks
"what serial # do you have for groups x, y, z". Probibly too much added
complexity I guess.

> > I'd agree collectd off probibly. Or at least a seperate one if we
> > needed to monitor them.
>
> I'm not sure what benefit we get from collectd on transient builders,
> though.
>
> On our long-running hosts I understand but not on the builders.

Yeah, the only case I can see is so we could see how loaded they are...
and we might have better ways to tell that.

> > Yeah, we could hopefully have another network thats larger than /24
> > for the arm builders.
>
> I can imagine various network changes should easily allow us to
> allocate larger than a /24 to the internal build network.

Yeah.

> > I'm sure some of this will be a process of 'oh no, what we have now
> > doesn't scale, lets fix it'. Of course some of it we can get ready
> > for up front too.
>
> yay for planning! :)
>
>
> > Overall I like the idea of the automated builder re-install and
> > think it will get us more ready for things like a large arm
> > cluster.
>
> Then I will get crackin' on making it work.

Sounds good.

kevin
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Dennis Gilmore 03-21-2012 03:45 PM

builders of the future!!!!!
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 21 Mar 2012 10:08:38 -0400
seth vidal <skvidal@fedoraproject.org> wrote:

> On Tue, 20 Mar 2012 21:38:13 -0500
> Dennis Gilmore <dennis@ausil.us> wrote:
>
>
> > Today there is not a way to do an anaconda install on any arm
> > system. though hopefully we will have that for deployment.
>
> I would hope so. :)
>
> > probably we would be adding 100-300 systems. not only do we need to
> > consider overloading of puppet, but also logging and monitoring. I
> > guess its more how do we scale our infrastructure from at a guess
> > ~100 nodes today to 3 to 4 times that
>
> Centrally logging the builders is probably unnecessary. Especially if
> we're bouncing them all the time.

i think it could be useful for capacity planning and detecting when
things go bad(TM). I wouldn't cry if we do not have it.

>
> > honestly we could do this instead of the monthly updates. just
> > rebuild them instead
>
> Sure - but I'm thinking of the emergency "oh look at that nightmare"
> updates.
>
>
> > im ok with that, im pretty sure fas will scale to the extra boxes.
> > do we drop monitoring of the builders? what about collectd etc.
>
> Collectd - off. We're not gaining much by having that punish the
> syslog server.
> We can monitor the builders w/o needing all of the copious info that
> collectd provides.
>
> fas I'm not very worried about - though I suspect a couple of things
> will change w/how we get the dbs onto the hosts.
>
>
> > main issue is that today we are not 100% sure of how we will install
> > arm boxes. how do we deal with all the non puppet related systems?
>
> I think, if the playbooks are working well, we can use ansible to do
> this.
>
> > also need to look into how we can better scale koji itself. when we
> > go from 20 to 200+ builders we need to make sure that load doesn't
> > cause koji to fall over.
>
> okay - but I think that's more something for the kojidevs than fedora
> infra?
not really, its not that koji itself wont scale but that we really
will likely need to look at load balancing again, or look at an
internal hub or 2, each builder checks in every 10 seconds to see if
there is anything to do. all state and everything else is stored in
the db. so adding multiple hubs to read and write to the db are ok.
but i want to make sure that 300 hosts checking in and all the public
traffic for koji get gracefully handled

>
> > all the arm boxes will have management consoles. but today im not
> > 100% sure how access to that would be. we would also need to deploy
> > fedora for any arm based systems. things we need to reconsider also
> > is networking today the storage network and the builder networks
> > are /24's so we could use 253 nodes. i suspect we will go over that
> > on the build network. we could not have the storage network on arm
> > builders. it is really only needed for createrepo. but we may need
> > to look at expanding kojipkgs to more nodes. or increase its network
> > throughput with multiple bonded gig network ports. think mass
> > rebuild and 100 or 200 buildroots initialising at once. it will
> > stress our resources on all levels. but the flexibility of so many
> > nodes could allow us to deploy solid solutions to scale and show
> > that fedora is still the leader in open infrastructure and sets
> > industry best practices.
>
> So one thing I'm not sure I understand - why would we need so many arm
> builders? Is it b/c there are so many more arm archs so there will
> need to be more pkgs built?

2 reasons why we will be looking at so many. hardware and software
floating point are incompatiable. so builders that are building
hardware floating point only build hardware floating point and the same
for software floating point. and while we are looking at quad core
1.5ghz-2.0ghz builders with 4gb ram to start with they are still not
quite as powerful a as there x86 counterparts. since they are low power
3-10 watts per node as opposed to 200-300watts for the existing
builders I want to err on the side of too many rather than not enough
and have people complain that they have to wait for a arm builder.
realistically mass rebuilds are when it will be most noticiable. At a
minimum I want at least double the number of x86 nodes for each
arch so ~80 total. I do have on my list of things to come up with some
reporting from arm koji and primary koji to see what the average build
time is. knowing that what will will deploy will be faster than what
we have today.

Dennis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)

iEYEARECAAYFAk9qBbEACgkQkSxm47BaWfdxZACffjC7ZKxITw xrskW2Zf+vOsa/
OlwAnR3qmy2oOfBm0RAcpcCjNItF5bxV
=lgst
-----END PGP SIGNATURE-----
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Jesse Keating 03-21-2012 05:42 PM

builders of the future!!!!!
 
On 3/20/12 7:38 PM, Dennis Gilmore wrote:

probably we would be adding 100-300 systems. not only do we need to
consider overloading of puppet, but also logging and monitoring. I
guess its more how do we scale our infrastructure from at a guess ~100
nodes today to 3 to 4 times that



Do we know how well kojihub will scale with 300+ builders? I know we've
had issues before where a large number of builders causes some
interesting issues when they are all pinging home to see if there is any
work to be done.


--
Jesse Keating
Fedora -- Freedom² is a feature!
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Jesse Keating 03-21-2012 05:48 PM

builders of the future!!!!!
 
On 3/21/12 11:42 AM, Jesse Keating wrote:

On 3/20/12 7:38 PM, Dennis Gilmore wrote:

probably we would be adding 100-300 systems. not only do we need to
consider overloading of puppet, but also logging and monitoring. I
guess its more how do we scale our infrastructure from at a guess ~100
nodes today to 3 to 4 times that



Do we know how well kojihub will scale with 300+ builders? I know we've
had issues before where a large number of builders causes some
interesting issues when they are all pinging home to see if there is any
work to be done.



Disregard, I see this was already discussed.

--
Help me fight child abuse: http://tinyurl.com/jlkcourage

- jlk
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Seth Vidal 07-24-2012 06:34 PM

builders of the future!!!!!
 
On Wed, 21 Mar 2012, Kevin Fenzi wrote:




I'd agree collectd off probibly. Or at least a seperate one if we
needed to monitor them.


I'm not sure what benefit we get from collectd on transient builders,
though.

On our long-running hosts I understand but not on the builders.


Yeah, the only case I can see is so we could see how loaded they are...
and we might have better ways to tell that.


Yeah, we could hopefully have another network thats larger than /24
for the arm builders.


I can imagine various network changes should easily allow us to
allocate larger than a /24 to the internal build network.


Yeah.


I'm sure some of this will be a process of 'oh no, what we have now
doesn't scale, lets fix it'. Of course some of it we can get ready
for up front too.


yay for planning! :)



Overall I like the idea of the automated builder re-install and
think it will get us more ready for things like a large arm
cluster.


Then I will get crackin' on making it work.

Sounds good.


I wanted to come back around to this discussion to close it out- as we
are most of the way complete here:


In the last few weeks I've setup a system that deploys a new builder,
provisions it and gets it ready in a single command.


It's in the builder git repository. This repo is on lockbox but it is only
accessible to sysadmin-main and sysadmin-releng.


I've posted a site-specific sanitized version of the script I'm using
here:

http://fedorapeople.org/cgit/skvidal/public_git/scripts.git/tree/ansible/start-prov-boot.py

and I'll be happy to post the playbooks I'm using to provision these
hosts.


The repo is restricted b/c it contains some certs/ssl keys that we aren't
going to give away to everyone :)


The process for reinstalling a host is incredibly trivial, we built all
the hosts for the latest mass rebuild using that process. It takes a
single command and you walk away.


(other than any enabling of the build in koji).

The next step is to put this process into a cron job so we, ideally, can
reinstall a certain percentage of our builders at any/all times.


We're using ansible for all of the command/control and it has been
remarkably stable for our use case. It does require ssh keys on the hosts
but we have that set via kickstarts now for the builders.


After some discussion we took the step of removing FAS and all fedora
accounts from the builders. We couldn't come up with a compelling reason
to keep these throw-away hosts coupled to FAS since the only folks
connecting to them were sysadmin-main/releng - it was a waste of time to
setup and keep the FAS db on the hosts current. Furthermore, it was an
additional risk that a rogue package could try to snatch up our fas db and
crack the passwords.


If anyone has any questions about how this works or would like any piece
of the infrastructure for doing it (other than the certs/keys :)) please
email to this list and ask.


-sv

_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure


All times are GMT. The time now is 10:24 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.